1 =============================
2 User Guide for AMDGPU Backend
3 =============================
14 AMDGPU/AMDGPUAsmGFX900
15 AMDGPU/AMDGPUAsmGFX904
16 AMDGPU/AMDGPUAsmGFX906
17 AMDGPU/AMDGPUAsmGFX908
18 AMDGPU/AMDGPUAsmGFX90a
19 AMDGPU/AMDGPUAsmGFX940
21 AMDGPU/AMDGPUAsmGFX1011
22 AMDGPU/AMDGPUAsmGFX1013
23 AMDGPU/AMDGPUAsmGFX1030
27 AMDGPUInstructionSyntax
28 AMDGPUInstructionNotation
29 AMDGPUDwarfExtensionsForHeterogeneousDebugging
30 AMDGPUDwarfExtensionAllowLocationDescriptionOnTheDwarfExpressionStack/AMDGPUDwarfExtensionAllowLocationDescriptionOnTheDwarfExpressionStack
35 The AMDGPU backend provides ISA code generation for AMD GPUs, starting with the
36 R600 family up until the current GCN families. It lives in the
37 ``llvm/lib/Target/AMDGPU`` directory.
42 .. _amdgpu-target-triples:
47 Use the Clang option ``-target <Architecture>-<Vendor>-<OS>-<Environment>``
48 to specify the target triple:
50 .. table:: AMDGPU Architectures
51 :name: amdgpu-architecture-table
53 ============ ==============================================================
54 Architecture Description
55 ============ ==============================================================
56 ``r600`` AMD GPUs HD2XXX-HD6XXX for graphics and compute shaders.
57 ``amdgcn`` AMD GPUs GCN GFX6 onwards for graphics and compute shaders.
58 ============ ==============================================================
60 .. table:: AMDGPU Vendors
61 :name: amdgpu-vendor-table
63 ============ ==============================================================
65 ============ ==============================================================
66 ``amd`` Can be used for all AMD GPU usage.
67 ``mesa3d`` Can be used if the OS is ``mesa3d``.
68 ============ ==============================================================
70 .. table:: AMDGPU Operating Systems
73 ============== ============================================================
75 ============== ============================================================
76 *<empty>* Defaults to the *unknown* OS.
77 ``amdhsa`` Compute kernels executed on HSA [HSA]_ compatible runtimes
80 - AMD's ROCm™ runtime [AMD-ROCm]_ using the *rocm-amdhsa*
81 loader on Linux. See *AMD ROCm Platform Release Notes*
82 [AMD-ROCm-Release-Notes]_ for supported hardware and
84 - AMD's PAL runtime using the *pal-amdhsa* loader on
87 ``amdpal`` Graphic shaders and compute kernels executed on AMD's PAL
88 runtime using the *pal-amdpal* loader on Windows and Linux
90 ``mesa3d`` Graphic shaders and compute kernels executed on AMD's Mesa
91 3D runtime using the *mesa-mesa3d* loader on Linux.
92 ============== ============================================================
94 .. table:: AMDGPU Environments
95 :name: amdgpu-environment-table
97 ============ ==============================================================
98 Environment Description
99 ============ ==============================================================
101 ============ ==============================================================
103 .. _amdgpu-processors:
108 Use the Clang options ``-mcpu=<target-id>`` or ``--offload-arch=<target-id>`` to
109 specify the AMDGPU processor together with optional target features. See
110 :ref:`amdgpu-target-id` and :ref:`amdgpu-target-features` for AMD GPU target
111 specific information.
113 Every processor supports every OS ABI (see :ref:`amdgpu-os`) with the following exceptions:
115 * ``amdhsa`` is not supported in ``r600`` architecture (see :ref:`amdgpu-architecture-table`).
118 .. table:: AMDGPU Processors
119 :name: amdgpu-processor-table
121 =========== =============== ============ ===== ================= =============== =============== ======================
122 Processor Alternative Target dGPU/ Target Target OS Support Example
123 Processor Triple APU Features Properties *(see* Products
124 Architecture Supported `amdgpu-os`_
133 =========== =============== ============ ===== ================= =============== =============== ======================
134 **Radeon HD 2000/3000 Series (R600)** [AMD-RADEON-HD-2000-3000]_
135 -----------------------------------------------------------------------------------------------------------------------
136 ``r600`` ``r600`` dGPU - Does not
141 ``r630`` ``r600`` dGPU - Does not
146 ``rs880`` ``r600`` dGPU - Does not
151 ``rv670`` ``r600`` dGPU - Does not
156 **Radeon HD 4000 Series (R700)** [AMD-RADEON-HD-4000]_
157 -----------------------------------------------------------------------------------------------------------------------
158 ``rv710`` ``r600`` dGPU - Does not
163 ``rv730`` ``r600`` dGPU - Does not
168 ``rv770`` ``r600`` dGPU - Does not
173 **Radeon HD 5000 Series (Evergreen)** [AMD-RADEON-HD-5000]_
174 -----------------------------------------------------------------------------------------------------------------------
175 ``cedar`` ``r600`` dGPU - Does not
180 ``cypress`` ``r600`` dGPU - Does not
185 ``juniper`` ``r600`` dGPU - Does not
190 ``redwood`` ``r600`` dGPU - Does not
195 ``sumo`` ``r600`` dGPU - Does not
200 **Radeon HD 6000 Series (Northern Islands)** [AMD-RADEON-HD-6000]_
201 -----------------------------------------------------------------------------------------------------------------------
202 ``barts`` ``r600`` dGPU - Does not
207 ``caicos`` ``r600`` dGPU - Does not
212 ``cayman`` ``r600`` dGPU - Does not
217 ``turks`` ``r600`` dGPU - Does not
222 **GCN GFX6 (Southern Islands (SI))** [AMD-GCN-GFX6]_
223 -----------------------------------------------------------------------------------------------------------------------
224 ``gfx600`` - ``tahiti`` ``amdgcn`` dGPU - Does not - *pal-amdpal*
229 ``gfx601`` - ``pitcairn`` ``amdgcn`` dGPU - Does not - *pal-amdpal*
234 ``gfx602`` - ``hainan`` ``amdgcn`` dGPU - Does not - *pal-amdpal*
239 **GCN GFX7 (Sea Islands (CI))** [AMD-GCN-GFX7]_
240 -----------------------------------------------------------------------------------------------------------------------
241 ``gfx700`` - ``kaveri`` ``amdgcn`` APU - Offset - *rocm-amdhsa* - A6-7000
242 flat - *pal-amdhsa* - A6 Pro-7050B
243 scratch - *pal-amdpal* - A8-7100
251 ``gfx701`` - ``hawaii`` ``amdgcn`` dGPU - Offset - *rocm-amdhsa* - FirePro W8100
252 flat - *pal-amdhsa* - FirePro W9100
253 scratch - *pal-amdpal* - FirePro S9150
255 ``gfx702`` ``amdgcn`` dGPU - Offset - *rocm-amdhsa* - Radeon R9 290
256 flat - *pal-amdhsa* - Radeon R9 290x
257 scratch - *pal-amdpal* - Radeon R390
259 ``gfx703`` - ``kabini`` ``amdgcn`` APU - Offset - *pal-amdhsa* - E1-2100
260 - ``mullins`` flat - *pal-amdpal* - E1-2200
268 ``gfx704`` - ``bonaire`` ``amdgcn`` dGPU - Offset - *pal-amdhsa* - Radeon HD 7790
269 flat - *pal-amdpal* - Radeon HD 8770
272 ``gfx705`` ``amdgcn`` APU - Offset - *pal-amdhsa* *TBA*
279 **GCN GFX8 (Volcanic Islands (VI))** [AMD-GCN-GFX8]_
280 -----------------------------------------------------------------------------------------------------------------------
281 ``gfx801`` - ``carrizo`` ``amdgcn`` APU - xnack - Offset - *rocm-amdhsa* - A6-8500P
282 flat - *pal-amdhsa* - Pro A6-8500B
283 scratch - *pal-amdpal* - A8-8600P
299 ``gfx802`` - ``iceland`` ``amdgcn`` dGPU - Offset - *rocm-amdhsa* - Radeon R9 285
300 - ``tonga`` flat - *pal-amdhsa* - Radeon R9 380
301 scratch - *pal-amdpal* - Radeon R9 385
302 ``gfx803`` - ``fiji`` ``amdgcn`` dGPU - *rocm-amdhsa* - Radeon R9 Nano
303 - *pal-amdhsa* - Radeon R9 Fury
304 - *pal-amdpal* - Radeon R9 FuryX
307 - Radeon Instinct MI8
308 \ - ``polaris10`` ``amdgcn`` dGPU - Offset - *rocm-amdhsa* - Radeon RX 470
309 flat - *pal-amdhsa* - Radeon RX 480
310 scratch - *pal-amdpal* - Radeon Instinct MI6
311 \ - ``polaris11`` ``amdgcn`` dGPU - Offset - *rocm-amdhsa* - Radeon RX 460
313 scratch - *pal-amdpal*
314 ``gfx805`` - ``tongapro`` ``amdgcn`` dGPU - Offset - *rocm-amdhsa* - FirePro S7150
315 flat - *pal-amdhsa* - FirePro S7100
316 scratch - *pal-amdpal* - FirePro W7100
319 ``gfx810`` - ``stoney`` ``amdgcn`` APU - xnack - Offset - *rocm-amdhsa* *TBA*
321 scratch - *pal-amdpal* .. TODO::
326 **GCN GFX9 (Vega)** [AMD-GCN-GFX900-GFX904-VEGA]_ [AMD-GCN-GFX906-VEGA7NM]_ [AMD-GCN-GFX908-CDNA1]_ [AMD-GCN-GFX90A-CDNA2]_
327 -----------------------------------------------------------------------------------------------------------------------
328 ``gfx900`` ``amdgcn`` dGPU - xnack - Absolute - *rocm-amdhsa* - Radeon Vega
329 flat - *pal-amdhsa* Frontier Edition
330 scratch - *pal-amdpal* - Radeon RX Vega 56
334 - Radeon Instinct MI25
335 ``gfx902`` ``amdgcn`` APU - xnack - Absolute - *rocm-amdhsa* - Ryzen 3 2200G
336 flat - *pal-amdhsa* - Ryzen 5 2400G
337 scratch - *pal-amdpal*
338 ``gfx904`` ``amdgcn`` dGPU - xnack - *rocm-amdhsa* *TBA*
340 - *pal-amdpal* .. TODO::
345 ``gfx906`` ``amdgcn`` dGPU - sramecc - Absolute - *rocm-amdhsa* - Radeon Instinct MI50
346 - xnack flat - *pal-amdhsa* - Radeon Instinct MI60
347 scratch - *pal-amdpal* - Radeon VII
349 ``gfx908`` ``amdgcn`` dGPU - sramecc - *rocm-amdhsa* - AMD Instinct MI100 Accelerator
353 ``gfx909`` ``amdgcn`` APU - xnack - Absolute - *pal-amdpal* *TBA*
360 ``gfx90a`` ``amdgcn`` dGPU - sramecc - Absolute - *rocm-amdhsa* *TBA*
362 - xnack scratch .. TODO::
364 work-item Add product
367 ``gfx90c`` ``amdgcn`` APU - xnack - Absolute - *pal-amdpal* - Ryzen 7 4700G
368 flat - Ryzen 7 4700GE
369 scratch - Ryzen 5 4600G
381 ``gfx940`` ``amdgcn`` dGPU - sramecc - Architected *TBA*
383 - xnack scratch .. TODO::
385 work-item Add product
388 **GCN GFX10.1 (RDNA 1)** [AMD-GCN-GFX10-RDNA1]_
389 -----------------------------------------------------------------------------------------------------------------------
390 ``gfx1010`` ``amdgcn`` dGPU - cumode - Absolute - *rocm-amdhsa* - Radeon RX 5700
391 - wavefrontsize64 flat - *pal-amdhsa* - Radeon RX 5700 XT
392 - xnack scratch - *pal-amdpal* - Radeon Pro 5600 XT
394 ``gfx1011`` ``amdgcn`` dGPU - cumode - *rocm-amdhsa* - Radeon Pro V520
395 - wavefrontsize64 - Absolute - *pal-amdhsa*
396 - xnack flat - *pal-amdpal*
398 ``gfx1012`` ``amdgcn`` dGPU - cumode - Absolute - *rocm-amdhsa* - Radeon RX 5500
399 - wavefrontsize64 flat - *pal-amdhsa* - Radeon RX 5500 XT
400 - xnack scratch - *pal-amdpal*
401 ``gfx1013`` ``amdgcn`` APU - cumode - Absolute - *rocm-amdhsa* *TBA*
402 - wavefrontsize64 flat - *pal-amdhsa*
403 - xnack scratch - *pal-amdpal* .. TODO::
408 **GCN GFX10.3 (RDNA 2)** [AMD-GCN-GFX10-RDNA2]_
409 -----------------------------------------------------------------------------------------------------------------------
410 ``gfx1030`` ``amdgcn`` dGPU - cumode - Absolute - *rocm-amdhsa* - Radeon RX 6800
411 - wavefrontsize64 flat - *pal-amdhsa* - Radeon RX 6800 XT
412 scratch - *pal-amdpal* - Radeon RX 6900 XT
413 ``gfx1031`` ``amdgcn`` dGPU - cumode - Absolute - *rocm-amdhsa* - Radeon RX 6700 XT
414 - wavefrontsize64 flat - *pal-amdhsa*
415 scratch - *pal-amdpal*
416 ``gfx1032`` ``amdgcn`` dGPU - cumode - Absolute - *rocm-amdhsa* *TBA*
417 - wavefrontsize64 flat - *pal-amdhsa*
418 scratch - *pal-amdpal* .. TODO::
423 ``gfx1033`` ``amdgcn`` APU - cumode - Absolute - *pal-amdpal* *TBA*
424 - wavefrontsize64 flat
429 ``gfx1034`` ``amdgcn`` dGPU - cumode - Absolute - *pal-amdpal* *TBA*
430 - wavefrontsize64 flat
436 ``gfx1035`` ``amdgcn`` APU - cumode - Absolute - *pal-amdpal* *TBA*
437 - wavefrontsize64 flat
442 ``gfx1036`` ``amdgcn`` APU - cumode - Absolute - *pal-amdpal* *TBA*
443 - wavefrontsize64 flat
449 **GCN GFX11 (RDNA 3)** [AMD-GCN-GFX11-RDNA3]_
450 -----------------------------------------------------------------------------------------------------------------------
451 ``gfx1100`` ``amdgcn`` dGPU - cumode - Architected - *pal-amdpal* *TBA*
452 - wavefrontsize64 flat
455 work-item Add product
458 ``gfx1101`` ``amdgcn`` dGPU - cumode - Architected *TBA*
459 - wavefrontsize64 flat
462 work-item Add product
465 ``gfx1102`` ``amdgcn`` dGPU - cumode - Architected *TBA*
466 - wavefrontsize64 flat
469 work-item Add product
472 ``gfx1103`` ``amdgcn`` APU - cumode - Architected *TBA*
473 - wavefrontsize64 flat
476 work-item Add product
479 =========== =============== ============ ===== ================= =============== =============== ======================
481 .. _amdgpu-target-features:
486 Target features control how code is generated to support certain
487 processor specific features. Not all target features are supported by
488 all processors. The runtime must ensure that the features supported by
489 the device used to execute the code match the features enabled when
490 generating the code. A mismatch of features may result in incorrect
491 execution, or a reduction in performance.
493 The target features supported by each processor is listed in
494 :ref:`amdgpu-processor-table`.
496 Target features are controlled by exactly one of the following Clang
499 ``-mcpu=<target-id>`` or ``--offload-arch=<target-id>``
501 The ``-mcpu`` and ``--offload-arch`` can specify the target feature as
502 optional components of the target ID. If omitted, the target feature has the
503 ``any`` value. See :ref:`amdgpu-target-id`.
505 ``-m[no-]<target-feature>``
507 Target features not specified by the target ID are specified using a
508 separate option. These target features can have an ``on`` or ``off``
509 value. ``on`` is specified by omitting the ``no-`` prefix, and
510 ``off`` is specified by including the ``no-`` prefix. The default
511 if not specified is ``off``.
515 ``-mcpu=gfx908:xnack+``
516 Enable the ``xnack`` feature.
517 ``-mcpu=gfx908:xnack-``
518 Disable the ``xnack`` feature.
520 Enable the ``cumode`` feature.
522 Disable the ``cumode`` feature.
524 .. table:: AMDGPU Target Features
525 :name: amdgpu-target-features-table
527 =============== ============================ ==================================================
528 Target Feature Clang Option to Control Description
530 =============== ============================ ==================================================
531 cumode - ``-m[no-]cumode`` Control the wavefront execution mode used
532 when generating code for kernels. When disabled
533 native WGP wavefront execution mode is used,
534 when enabled CU wavefront execution mode is used
535 (see :ref:`amdgpu-amdhsa-memory-model`).
537 sramecc - ``-mcpu`` If specified, generate code that can only be
538 - ``--offload-arch`` loaded and executed in a process that has a
539 matching setting for SRAMECC.
541 If not specified for code object V2 to V3, generate
542 code that can be loaded and executed in a process
543 with SRAMECC enabled.
545 If not specified for code object V4 or above, generate
546 code that can be loaded and executed in a process
547 with either setting of SRAMECC.
549 tgsplit ``-m[no-]tgsplit`` Enable/disable generating code that assumes
550 work-groups are launched in threadgroup split mode.
551 When enabled the waves of a work-group may be
552 launched in different CUs.
554 wavefrontsize64 - ``-m[no-]wavefrontsize64`` Control the wavefront size used when
555 generating code for kernels. When disabled
556 native wavefront size 32 is used, when enabled
557 wavefront size 64 is used.
559 xnack - ``-mcpu`` If specified, generate code that can only be
560 - ``--offload-arch`` loaded and executed in a process that has a
561 matching setting for XNACK replay.
563 If not specified for code object V2 to V3, generate
564 code that can be loaded and executed in a process
565 with XNACK replay enabled.
567 If not specified for code object V4 or above, generate
568 code that can be loaded and executed in a process
569 with either setting of XNACK replay.
571 XNACK replay can be used for demand paging and
572 page migration. If enabled in the device, then if
573 a page fault occurs the code may execute
574 incorrectly unless generated with XNACK replay
575 enabled, or generated for code object V4 or above without
576 specifying XNACK replay. Executing code that was
577 generated with XNACK replay enabled, or generated
578 for code object V4 or above without specifying XNACK replay,
579 on a device that does not have XNACK replay
580 enabled will execute correctly but may be less
581 performant than code generated for XNACK replay
583 =============== ============================ ==================================================
585 .. _amdgpu-target-id:
590 AMDGPU supports target IDs. See `Clang Offload Bundler
591 <https://clang.llvm.org/docs/ClangOffloadBundler.html>`_ for a general
592 description. The AMDGPU target specific information is:
595 Is an AMDGPU processor or alternative processor name specified in
596 :ref:`amdgpu-processor-table`. The non-canonical form target ID allows both
597 the primary processor and alternative processor names. The canonical form
598 target ID only allow the primary processor name.
601 Is a target feature name specified in :ref:`amdgpu-target-features-table` that
602 is supported by the processor. The target features supported by each processor
603 is specified in :ref:`amdgpu-processor-table`. Those that can be specified in
604 a target ID are marked as being controlled by ``-mcpu`` and
605 ``--offload-arch``. Each target feature must appear at most once in a target
606 ID. The non-canonical form target ID allows the target features to be
607 specified in any order. The canonical form target ID requires the target
608 features to be specified in alphabetic order.
610 .. _amdgpu-target-id-v2-v3:
612 Code Object V2 to V3 Target ID
613 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
615 The target ID syntax for code object V2 to V3 is the same as defined in `Clang
616 Offload Bundler <https://clang.llvm.org/docs/ClangOffloadBundler.html>`_ except
617 when used in the :ref:`amdgpu-assembler-directive-amdgcn-target` assembler
618 directive and the bundle entry ID. In those cases it has the following BNF
623 <target-id> ::== <processor> ( "+" <target-feature> )*
625 Where a target feature is omitted if *Off* and present if *On* or *Any*.
629 The code object V2 to V3 cannot represent *Any* and treats it the same as
632 .. _amdgpu-embedding-bundled-objects:
634 Embedding Bundled Code Objects
635 ------------------------------
637 AMDGPU supports the HIP and OpenMP languages that perform code object embedding
638 as described in `Clang Offload Bundler
639 <https://clang.llvm.org/docs/ClangOffloadBundler.html>`_.
643 The target ID syntax used for code object V2 to V3 for a bundle entry ID
644 differs from that used elsewhere. See :ref:`amdgpu-target-id-v2-v3`.
646 .. _amdgpu-address-spaces:
651 The AMDGPU architecture supports a number of memory address spaces. The address
652 space names use the OpenCL standard names, with some additions.
654 The AMDGPU address spaces correspond to target architecture specific LLVM
655 address space numbers used in LLVM IR.
657 The AMDGPU address spaces are described in
658 :ref:`amdgpu-address-spaces-table`. Only 64-bit process address spaces are
659 supported for the ``amdgcn`` target.
661 .. table:: AMDGPU Address Spaces
662 :name: amdgpu-address-spaces-table
664 ================================= =============== =========== ================ ======= ============================
665 .. 64-Bit Process Address Space
666 --------------------------------- --------------- ----------- ---------------- ------------------------------------
667 Address Space Name LLVM IR Address HSA Segment Hardware Address NULL Value
668 Space Number Name Name Size
669 ================================= =============== =========== ================ ======= ============================
670 Generic 0 flat flat 64 0x0000000000000000
671 Global 1 global global 64 0x0000000000000000
672 Region 2 N/A GDS 32 *not implemented for AMDHSA*
673 Local 3 group LDS 32 0xFFFFFFFF
674 Constant 4 constant *same as global* 64 0x0000000000000000
675 Private 5 private scratch 32 0xFFFFFFFF
676 Constant 32-bit 6 *TODO* 0x00000000
677 Buffer Fat Pointer (experimental) 7 *TODO*
678 ================================= =============== =========== ================ ======= ============================
681 The generic address space is supported unless the *Target Properties* column
682 of :ref:`amdgpu-processor-table` specifies *Does not support generic address
685 The generic address space uses the hardware flat address support for two fixed
686 ranges of virtual addresses (the private and local apertures), that are
687 outside the range of addressable global memory, to map from a flat address to
688 a private or local address. This uses FLAT instructions that can take a flat
689 address and access global, private (scratch), and group (LDS) memory depending
690 on if the address is within one of the aperture ranges.
692 Flat access to scratch requires hardware aperture setup and setup in the
693 kernel prologue (see :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`). Flat
694 access to LDS requires hardware aperture setup and M0 (GFX7-GFX8) register
695 setup (see :ref:`amdgpu-amdhsa-kernel-prolog-m0`).
697 To convert between a private or group address space address (termed a segment
698 address) and a flat address the base address of the corresponding aperture
699 can be used. For GFX7-GFX8 these are available in the
700 :ref:`amdgpu-amdhsa-hsa-aql-queue` the address of which can be obtained with
701 Queue Ptr SGPR (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). For
702 GFX9-GFX11 the aperture base addresses are directly available as inline
703 constant registers ``SRC_SHARED_BASE/LIMIT`` and ``SRC_PRIVATE_BASE/LIMIT``.
704 In 64-bit address mode the aperture sizes are 2^32 bytes and the base is
705 aligned to 2^32 which makes it easier to convert from flat to segment or
708 A global address space address has the same value when used as a flat address
709 so no conversion is needed.
711 **Global and Constant**
712 The global and constant address spaces both use global virtual addresses,
713 which are the same virtual address space used by the CPU. However, some
714 virtual addresses may only be accessible to the CPU, some only accessible
715 by the GPU, and some by both.
717 Using the constant address space indicates that the data will not change
718 during the execution of the kernel. This allows scalar read instructions to
719 be used. As the constant address space could only be modified on the host
720 side, a generic pointer loaded from the constant address space is safe to be
721 assumed as a global pointer since only the device global memory is visible
722 and managed on the host side. The vector and scalar L1 caches are invalidated
723 of volatile data before each kernel dispatch execution to allow constant
724 memory to change values between kernel dispatches.
727 The region address space uses the hardware Global Data Store (GDS). All
728 wavefronts executing on the same device will access the same memory for any
729 given region address. However, the same region address accessed by wavefronts
730 executing on different devices will access different memory. It is higher
731 performance than global memory. It is allocated by the runtime. The data
732 store (DS) instructions can be used to access it.
735 The local address space uses the hardware Local Data Store (LDS) which is
736 automatically allocated when the hardware creates the wavefronts of a
737 work-group, and freed when all the wavefronts of a work-group have
738 terminated. All wavefronts belonging to the same work-group will access the
739 same memory for any given local address. However, the same local address
740 accessed by wavefronts belonging to different work-groups will access
741 different memory. It is higher performance than global memory. The data store
742 (DS) instructions can be used to access it.
745 The private address space uses the hardware scratch memory support which
746 automatically allocates memory when it creates a wavefront and frees it when
747 a wavefronts terminates. The memory accessed by a lane of a wavefront for any
748 given private address will be different to the memory accessed by another lane
749 of the same or different wavefront for the same private address.
751 If a kernel dispatch uses scratch, then the hardware allocates memory from a
752 pool of backing memory allocated by the runtime for each wavefront. The lanes
753 of the wavefront access this using dword (4 byte) interleaving. The mapping
754 used from private address to backing memory address is:
756 ``wavefront-scratch-base +
757 ((private-address / 4) * wavefront-size * 4) +
758 (wavefront-lane-id * 4) + (private-address % 4)``
760 If each lane of a wavefront accesses the same private address, the
761 interleaving results in adjacent dwords being accessed and hence requires
762 fewer cache lines to be fetched.
764 There are different ways that the wavefront scratch base address is
765 determined by a wavefront (see
766 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
768 Scratch memory can be accessed in an interleaved manner using buffer
769 instructions with the scratch buffer descriptor and per wavefront scratch
770 offset, by the scratch instructions, or by flat instructions. Multi-dword
771 access is not supported except by flat and scratch instructions in
777 **Buffer Fat Pointer**
778 The buffer fat pointer is an experimental address space that is currently
779 unsupported in the backend. It exposes a non-integral pointer that is in
780 the future intended to support the modelling of 128-bit buffer descriptors
781 plus a 32-bit offset into the buffer (in total encapsulating a 160-bit
782 *pointer*), allowing normal LLVM load/store/atomic operations to be used to
783 model the buffer descriptors used heavily in graphics workloads targeting
786 .. _amdgpu-memory-scopes:
791 This section provides LLVM memory synchronization scopes supported by the AMDGPU
792 backend memory model when the target triple OS is ``amdhsa`` (see
793 :ref:`amdgpu-amdhsa-memory-model` and :ref:`amdgpu-target-triples`).
795 The memory model supported is based on the HSA memory model [HSA]_ which is
796 based in turn on HRF-indirect with scope inclusion [HRF]_. The happens-before
797 relation is transitive over the synchronizes-with relation independent of scope
798 and synchronizes-with allows the memory scope instances to be inclusive (see
799 table :ref:`amdgpu-amdhsa-llvm-sync-scopes-table`).
801 This is different to the OpenCL [OpenCL]_ memory model which does not have scope
802 inclusion and requires the memory scopes to exactly match. However, this
803 is conservatively correct for OpenCL.
805 .. table:: AMDHSA LLVM Sync Scopes
806 :name: amdgpu-amdhsa-llvm-sync-scopes-table
808 ======================= ===================================================
809 LLVM Sync Scope Description
810 ======================= ===================================================
811 *none* The default: ``system``.
813 Synchronizes with, and participates in modification
814 and seq_cst total orderings with, other operations
815 (except image operations) for all address spaces
816 (except private, or generic that accesses private)
817 provided the other operation's sync scope is:
820 - ``agent`` and executed by a thread on the same
822 - ``workgroup`` and executed by a thread in the
824 - ``wavefront`` and executed by a thread in the
827 ``agent`` Synchronizes with, and participates in modification
828 and seq_cst total orderings with, other operations
829 (except image operations) for all address spaces
830 (except private, or generic that accesses private)
831 provided the other operation's sync scope is:
833 - ``system`` or ``agent`` and executed by a thread
835 - ``workgroup`` and executed by a thread in the
837 - ``wavefront`` and executed by a thread in the
840 ``workgroup`` Synchronizes with, and participates in modification
841 and seq_cst total orderings with, other operations
842 (except image operations) for all address spaces
843 (except private, or generic that accesses private)
844 provided the other operation's sync scope is:
846 - ``system``, ``agent`` or ``workgroup`` and
847 executed by a thread in the same work-group.
848 - ``wavefront`` and executed by a thread in the
851 ``wavefront`` Synchronizes with, and participates in modification
852 and seq_cst total orderings with, other operations
853 (except image operations) for all address spaces
854 (except private, or generic that accesses private)
855 provided the other operation's sync scope is:
857 - ``system``, ``agent``, ``workgroup`` or
858 ``wavefront`` and executed by a thread in the
861 ``singlethread`` Only synchronizes with and participates in
862 modification and seq_cst total orderings with,
863 other operations (except image operations) running
864 in the same thread for all address spaces (for
865 example, in signal handlers).
867 ``one-as`` Same as ``system`` but only synchronizes with other
868 operations within the same address space.
870 ``agent-one-as`` Same as ``agent`` but only synchronizes with other
871 operations within the same address space.
873 ``workgroup-one-as`` Same as ``workgroup`` but only synchronizes with
874 other operations within the same address space.
876 ``wavefront-one-as`` Same as ``wavefront`` but only synchronizes with
877 other operations within the same address space.
879 ``singlethread-one-as`` Same as ``singlethread`` but only synchronizes with
880 other operations within the same address space.
881 ======================= ===================================================
886 The AMDGPU backend implements the following LLVM IR intrinsics.
888 *This section is WIP.*
892 List AMDGPU intrinsics.
897 The AMDGPU backend supports the following LLVM IR attributes.
899 .. table:: AMDGPU LLVM IR Attributes
900 :name: amdgpu-llvm-ir-attributes-table
902 ======================================= ==========================================================
903 LLVM Attribute Description
904 ======================================= ==========================================================
905 "amdgpu-flat-work-group-size"="min,max" Specify the minimum and maximum flat work group sizes that
906 will be specified when the kernel is dispatched. Generated
907 by the ``amdgpu_flat_work_group_size`` CLANG attribute [CLANG-ATTR]_.
908 The implied default value is 1,1024.
910 "amdgpu-implicitarg-num-bytes"="n" Number of kernel argument bytes to add to the kernel
911 argument block size for the implicit arguments. This
912 varies by OS and language (for OpenCL see
913 :ref:`opencl-kernel-implicit-arguments-appended-for-amdhsa-os-table`).
914 "amdgpu-num-sgpr"="n" Specifies the number of SGPRs to use. Generated by
915 the ``amdgpu_num_sgpr`` CLANG attribute [CLANG-ATTR]_.
916 "amdgpu-num-vgpr"="n" Specifies the number of VGPRs to use. Generated by the
917 ``amdgpu_num_vgpr`` CLANG attribute [CLANG-ATTR]_.
918 "amdgpu-waves-per-eu"="m,n" Specify the minimum and maximum number of waves per
919 execution unit. Generated by the ``amdgpu_waves_per_eu``
920 CLANG attribute [CLANG-ATTR]_. This is an optimization hint,
921 and the backend may not be able to satisfy the request. If
922 the specified range is incompatible with the function's
923 "amdgpu-flat-work-group-size" value, the implied occupancy
924 bounds by the workgroup size takes precedence.
926 "amdgpu-ieee" true/false. Specify whether the function expects the IEEE field of the
927 mode register to be set on entry. Overrides the default for
928 the calling convention.
929 "amdgpu-dx10-clamp" true/false. Specify whether the function expects the DX10_CLAMP field of
930 the mode register to be set on entry. Overrides the default
931 for the calling convention.
933 "amdgpu-no-workitem-id-x" Indicates the function does not depend on the value of the
934 llvm.amdgcn.workitem.id.x intrinsic. If a function is marked with this
935 attribute, or reached through a call site marked with this attribute,
936 the value returned by the intrinsic is undefined. The backend can
937 generally infer this during code generation, so typically there is no
938 benefit to frontends marking functions with this.
940 "amdgpu-no-workitem-id-y" The same as amdgpu-no-workitem-id-x, except for the
941 llvm.amdgcn.workitem.id.y intrinsic.
943 "amdgpu-no-workitem-id-z" The same as amdgpu-no-workitem-id-x, except for the
944 llvm.amdgcn.workitem.id.z intrinsic.
946 "amdgpu-no-workgroup-id-x" The same as amdgpu-no-workitem-id-x, except for the
947 llvm.amdgcn.workgroup.id.x intrinsic.
949 "amdgpu-no-workgroup-id-y" The same as amdgpu-no-workitem-id-x, except for the
950 llvm.amdgcn.workgroup.id.y intrinsic.
952 "amdgpu-no-workgroup-id-z" The same as amdgpu-no-workitem-id-x, except for the
953 llvm.amdgcn.workgroup.id.z intrinsic.
955 "amdgpu-no-dispatch-ptr" The same as amdgpu-no-workitem-id-x, except for the
956 llvm.amdgcn.dispatch.ptr intrinsic.
958 "amdgpu-no-implicitarg-ptr" The same as amdgpu-no-workitem-id-x, except for the
959 llvm.amdgcn.implicitarg.ptr intrinsic.
961 "amdgpu-no-dispatch-id" The same as amdgpu-no-workitem-id-x, except for the
962 llvm.amdgcn.dispatch.id intrinsic.
964 "amdgpu-no-queue-ptr" Similar to amdgpu-no-workitem-id-x, except for the
965 llvm.amdgcn.queue.ptr intrinsic. Note that unlike the other ABI hint
966 attributes, the queue pointer may be required in situations where the
967 intrinsic call does not directly appear in the program. Some subtargets
968 require the queue pointer for to handle some addrspacecasts, as well
969 as the llvm.amdgcn.is.shared, llvm.amdgcn.is.private, llvm.trap, and
970 llvm.debug intrinsics.
972 "amdgpu-no-hostcall-ptr" Similar to amdgpu-no-implicitarg-ptr, except specific to the implicit
973 kernel argument that holds the pointer to the hostcall buffer. If this
974 attribute is absent, then the amdgpu-no-implicitarg-ptr is also removed.
976 "amdgpu-no-heap-ptr" Similar to amdgpu-no-implicitarg-ptr, except specific to the implicit
977 kernel argument that holds the pointer to an initialized memory buffer
978 that conforms to the requirements of the malloc/free device library V1
979 version implementation. If this attribute is absent, then the
980 amdgpu-no-implicitarg-ptr is also removed.
982 "amdgpu-no-multigrid-sync-arg" Similar to amdgpu-no-implicitarg-ptr, except specific to the implicit
983 kernel argument that holds the multigrid synchronization pointer. If this
984 attribute is absent, then the amdgpu-no-implicitarg-ptr is also removed.
986 "amdgpu-no-default-queue" Similar to amdgpu-no-implicitarg-ptr, except specific to the implicit
987 kernel argument that holds the default queue pointer. If this
988 attribute is absent, then the amdgpu-no-implicitarg-ptr is also removed.
990 "amdgpu-no-completion-action" Similar to amdgpu-no-implicitarg-ptr, except specific to the implicit
991 kernel argument that holds the completion action pointer. If this
992 attribute is absent, then the amdgpu-no-implicitarg-ptr is also removed.
994 ======================================= ==========================================================
996 .. _amdgpu-elf-code-object:
1001 The AMDGPU backend generates a standard ELF [ELF]_ relocatable code object that
1002 can be linked by ``lld`` to produce a standard ELF shared code object which can
1003 be loaded and executed on an AMDGPU target.
1005 .. _amdgpu-elf-header:
1010 The AMDGPU backend uses the following ELF header:
1012 .. table:: AMDGPU ELF Header
1013 :name: amdgpu-elf-header-table
1015 ========================== ===============================
1017 ========================== ===============================
1018 ``e_ident[EI_CLASS]`` ``ELFCLASS64``
1019 ``e_ident[EI_DATA]`` ``ELFDATA2LSB``
1020 ``e_ident[EI_OSABI]`` - ``ELFOSABI_NONE``
1021 - ``ELFOSABI_AMDGPU_HSA``
1022 - ``ELFOSABI_AMDGPU_PAL``
1023 - ``ELFOSABI_AMDGPU_MESA3D``
1024 ``e_ident[EI_ABIVERSION]`` - ``ELFABIVERSION_AMDGPU_HSA_V2``
1025 - ``ELFABIVERSION_AMDGPU_HSA_V3``
1026 - ``ELFABIVERSION_AMDGPU_HSA_V4``
1027 - ``ELFABIVERSION_AMDGPU_HSA_V5``
1028 - ``ELFABIVERSION_AMDGPU_PAL``
1029 - ``ELFABIVERSION_AMDGPU_MESA3D``
1030 ``e_type`` - ``ET_REL``
1032 ``e_machine`` ``EM_AMDGPU``
1034 ``e_flags`` See :ref:`amdgpu-elf-header-e_flags-v2-table`,
1035 :ref:`amdgpu-elf-header-e_flags-table-v3`,
1036 and :ref:`amdgpu-elf-header-e_flags-table-v4-onwards`
1037 ========================== ===============================
1041 .. table:: AMDGPU ELF Header Enumeration Values
1042 :name: amdgpu-elf-header-enumeration-values-table
1044 =============================== =====
1046 =============================== =====
1049 ``ELFOSABI_AMDGPU_HSA`` 64
1050 ``ELFOSABI_AMDGPU_PAL`` 65
1051 ``ELFOSABI_AMDGPU_MESA3D`` 66
1052 ``ELFABIVERSION_AMDGPU_HSA_V2`` 0
1053 ``ELFABIVERSION_AMDGPU_HSA_V3`` 1
1054 ``ELFABIVERSION_AMDGPU_HSA_V4`` 2
1055 ``ELFABIVERSION_AMDGPU_HSA_V5`` 3
1056 ``ELFABIVERSION_AMDGPU_PAL`` 0
1057 ``ELFABIVERSION_AMDGPU_MESA3D`` 0
1058 =============================== =====
1060 ``e_ident[EI_CLASS]``
1063 * ``ELFCLASS32`` for ``r600`` architecture.
1065 * ``ELFCLASS64`` for ``amdgcn`` architecture which only supports 64-bit
1066 process address space applications.
1068 ``e_ident[EI_DATA]``
1069 All AMDGPU targets use ``ELFDATA2LSB`` for little-endian byte ordering.
1071 ``e_ident[EI_OSABI]``
1072 One of the following AMDGPU target architecture specific OS ABIs
1073 (see :ref:`amdgpu-os`):
1075 * ``ELFOSABI_NONE`` for *unknown* OS.
1077 * ``ELFOSABI_AMDGPU_HSA`` for ``amdhsa`` OS.
1079 * ``ELFOSABI_AMDGPU_PAL`` for ``amdpal`` OS.
1081 * ``ELFOSABI_AMDGPU_MESA3D`` for ``mesa3D`` OS.
1083 ``e_ident[EI_ABIVERSION]``
1084 The ABI version of the AMDGPU target architecture specific OS ABI to which the code
1087 * ``ELFABIVERSION_AMDGPU_HSA_V2`` is used to specify the version of AMD HSA
1088 runtime ABI for code object V2. Specify using the Clang option
1089 ``-mcode-object-version=2``.
1091 * ``ELFABIVERSION_AMDGPU_HSA_V3`` is used to specify the version of AMD HSA
1092 runtime ABI for code object V3. Specify using the Clang option
1093 ``-mcode-object-version=3``.
1095 * ``ELFABIVERSION_AMDGPU_HSA_V4`` is used to specify the version of AMD HSA
1096 runtime ABI for code object V4. Specify using the Clang option
1097 ``-mcode-object-version=4``. This is the default code object
1098 version if not specified.
1100 * ``ELFABIVERSION_AMDGPU_HSA_V5`` is used to specify the version of AMD HSA
1101 runtime ABI for code object V5. Specify using the Clang option
1102 ``-mcode-object-version=5``.
1104 * ``ELFABIVERSION_AMDGPU_PAL`` is used to specify the version of AMD PAL
1107 * ``ELFABIVERSION_AMDGPU_MESA3D`` is used to specify the version of AMD MESA
1111 Can be one of the following values:
1115 The type produced by the AMDGPU backend compiler as it is relocatable code
1119 The type produced by the linker as it is a shared code object.
1121 The AMD HSA runtime loader requires a ``ET_DYN`` code object.
1124 The value ``EM_AMDGPU`` is used for the machine for all processors supported
1125 by the ``r600`` and ``amdgcn`` architectures (see
1126 :ref:`amdgpu-processor-table`). The specific processor is specified in the
1127 ``NT_AMD_HSA_ISA_VERSION`` note record for code object V2 (see
1128 :ref:`amdgpu-note-records-v2`) and in the ``EF_AMDGPU_MACH`` bit field of the
1129 ``e_flags`` for code object V3 and above (see
1130 :ref:`amdgpu-elf-header-e_flags-table-v3` and
1131 :ref:`amdgpu-elf-header-e_flags-table-v4-onwards`).
1134 The entry point is 0 as the entry points for individual kernels must be
1135 selected in order to invoke them through AQL packets.
1138 The AMDGPU backend uses the following ELF header flags:
1140 .. table:: AMDGPU ELF Header ``e_flags`` for Code Object V2
1141 :name: amdgpu-elf-header-e_flags-v2-table
1143 ===================================== ===== =============================
1144 Name Value Description
1145 ===================================== ===== =============================
1146 ``EF_AMDGPU_FEATURE_XNACK_V2`` 0x01 Indicates if the ``xnack``
1148 enabled for all code
1149 contained in the code object.
1151 does not support the
1156 :ref:`amdgpu-target-features`.
1157 ``EF_AMDGPU_FEATURE_TRAP_HANDLER_V2`` 0x02 Indicates if the trap
1158 handler is enabled for all
1159 code contained in the code
1160 object. If the processor
1161 does not support a trap
1162 handler then must be 0.
1164 :ref:`amdgpu-target-features`.
1165 ===================================== ===== =============================
1167 .. table:: AMDGPU ELF Header ``e_flags`` for Code Object V3
1168 :name: amdgpu-elf-header-e_flags-table-v3
1170 ================================= ===== =============================
1171 Name Value Description
1172 ================================= ===== =============================
1173 ``EF_AMDGPU_MACH`` 0x0ff AMDGPU processor selection
1175 ``EF_AMDGPU_MACH_xxx`` values
1177 :ref:`amdgpu-ef-amdgpu-mach-table`.
1178 ``EF_AMDGPU_FEATURE_XNACK_V3`` 0x100 Indicates if the ``xnack``
1180 enabled for all code
1181 contained in the code object.
1183 does not support the
1188 :ref:`amdgpu-target-features`.
1189 ``EF_AMDGPU_FEATURE_SRAMECC_V3`` 0x200 Indicates if the ``sramecc``
1191 enabled for all code
1192 contained in the code object.
1194 does not support the
1199 :ref:`amdgpu-target-features`.
1200 ================================= ===== =============================
1202 .. table:: AMDGPU ELF Header ``e_flags`` for Code Object V4 and After
1203 :name: amdgpu-elf-header-e_flags-table-v4-onwards
1205 ============================================ ===== ===================================
1206 Name Value Description
1207 ============================================ ===== ===================================
1208 ``EF_AMDGPU_MACH`` 0x0ff AMDGPU processor selection
1210 ``EF_AMDGPU_MACH_xxx`` values
1212 :ref:`amdgpu-ef-amdgpu-mach-table`.
1213 ``EF_AMDGPU_FEATURE_XNACK_V4`` 0x300 XNACK selection mask for
1214 ``EF_AMDGPU_FEATURE_XNACK_*_V4``
1216 ``EF_AMDGPU_FEATURE_XNACK_UNSUPPORTED_V4`` 0x000 XNACK unsuppored.
1217 ``EF_AMDGPU_FEATURE_XNACK_ANY_V4`` 0x100 XNACK can have any value.
1218 ``EF_AMDGPU_FEATURE_XNACK_OFF_V4`` 0x200 XNACK disabled.
1219 ``EF_AMDGPU_FEATURE_XNACK_ON_V4`` 0x300 XNACK enabled.
1220 ``EF_AMDGPU_FEATURE_SRAMECC_V4`` 0xc00 SRAMECC selection mask for
1221 ``EF_AMDGPU_FEATURE_SRAMECC_*_V4``
1223 ``EF_AMDGPU_FEATURE_SRAMECC_UNSUPPORTED_V4`` 0x000 SRAMECC unsuppored.
1224 ``EF_AMDGPU_FEATURE_SRAMECC_ANY_V4`` 0x400 SRAMECC can have any value.
1225 ``EF_AMDGPU_FEATURE_SRAMECC_OFF_V4`` 0x800 SRAMECC disabled,
1226 ``EF_AMDGPU_FEATURE_SRAMECC_ON_V4`` 0xc00 SRAMECC enabled.
1227 ============================================ ===== ===================================
1229 .. table:: AMDGPU ``EF_AMDGPU_MACH`` Values
1230 :name: amdgpu-ef-amdgpu-mach-table
1232 ==================================== ========== =============================
1233 Name Value Description (see
1234 :ref:`amdgpu-processor-table`)
1235 ==================================== ========== =============================
1236 ``EF_AMDGPU_MACH_NONE`` 0x000 *not specified*
1237 ``EF_AMDGPU_MACH_R600_R600`` 0x001 ``r600``
1238 ``EF_AMDGPU_MACH_R600_R630`` 0x002 ``r630``
1239 ``EF_AMDGPU_MACH_R600_RS880`` 0x003 ``rs880``
1240 ``EF_AMDGPU_MACH_R600_RV670`` 0x004 ``rv670``
1241 ``EF_AMDGPU_MACH_R600_RV710`` 0x005 ``rv710``
1242 ``EF_AMDGPU_MACH_R600_RV730`` 0x006 ``rv730``
1243 ``EF_AMDGPU_MACH_R600_RV770`` 0x007 ``rv770``
1244 ``EF_AMDGPU_MACH_R600_CEDAR`` 0x008 ``cedar``
1245 ``EF_AMDGPU_MACH_R600_CYPRESS`` 0x009 ``cypress``
1246 ``EF_AMDGPU_MACH_R600_JUNIPER`` 0x00a ``juniper``
1247 ``EF_AMDGPU_MACH_R600_REDWOOD`` 0x00b ``redwood``
1248 ``EF_AMDGPU_MACH_R600_SUMO`` 0x00c ``sumo``
1249 ``EF_AMDGPU_MACH_R600_BARTS`` 0x00d ``barts``
1250 ``EF_AMDGPU_MACH_R600_CAICOS`` 0x00e ``caicos``
1251 ``EF_AMDGPU_MACH_R600_CAYMAN`` 0x00f ``cayman``
1252 ``EF_AMDGPU_MACH_R600_TURKS`` 0x010 ``turks``
1253 *reserved* 0x011 - Reserved for ``r600``
1254 0x01f architecture processors.
1255 ``EF_AMDGPU_MACH_AMDGCN_GFX600`` 0x020 ``gfx600``
1256 ``EF_AMDGPU_MACH_AMDGCN_GFX601`` 0x021 ``gfx601``
1257 ``EF_AMDGPU_MACH_AMDGCN_GFX700`` 0x022 ``gfx700``
1258 ``EF_AMDGPU_MACH_AMDGCN_GFX701`` 0x023 ``gfx701``
1259 ``EF_AMDGPU_MACH_AMDGCN_GFX702`` 0x024 ``gfx702``
1260 ``EF_AMDGPU_MACH_AMDGCN_GFX703`` 0x025 ``gfx703``
1261 ``EF_AMDGPU_MACH_AMDGCN_GFX704`` 0x026 ``gfx704``
1262 *reserved* 0x027 Reserved.
1263 ``EF_AMDGPU_MACH_AMDGCN_GFX801`` 0x028 ``gfx801``
1264 ``EF_AMDGPU_MACH_AMDGCN_GFX802`` 0x029 ``gfx802``
1265 ``EF_AMDGPU_MACH_AMDGCN_GFX803`` 0x02a ``gfx803``
1266 ``EF_AMDGPU_MACH_AMDGCN_GFX810`` 0x02b ``gfx810``
1267 ``EF_AMDGPU_MACH_AMDGCN_GFX900`` 0x02c ``gfx900``
1268 ``EF_AMDGPU_MACH_AMDGCN_GFX902`` 0x02d ``gfx902``
1269 ``EF_AMDGPU_MACH_AMDGCN_GFX904`` 0x02e ``gfx904``
1270 ``EF_AMDGPU_MACH_AMDGCN_GFX906`` 0x02f ``gfx906``
1271 ``EF_AMDGPU_MACH_AMDGCN_GFX908`` 0x030 ``gfx908``
1272 ``EF_AMDGPU_MACH_AMDGCN_GFX909`` 0x031 ``gfx909``
1273 ``EF_AMDGPU_MACH_AMDGCN_GFX90C`` 0x032 ``gfx90c``
1274 ``EF_AMDGPU_MACH_AMDGCN_GFX1010`` 0x033 ``gfx1010``
1275 ``EF_AMDGPU_MACH_AMDGCN_GFX1011`` 0x034 ``gfx1011``
1276 ``EF_AMDGPU_MACH_AMDGCN_GFX1012`` 0x035 ``gfx1012``
1277 ``EF_AMDGPU_MACH_AMDGCN_GFX1030`` 0x036 ``gfx1030``
1278 ``EF_AMDGPU_MACH_AMDGCN_GFX1031`` 0x037 ``gfx1031``
1279 ``EF_AMDGPU_MACH_AMDGCN_GFX1032`` 0x038 ``gfx1032``
1280 ``EF_AMDGPU_MACH_AMDGCN_GFX1033`` 0x039 ``gfx1033``
1281 ``EF_AMDGPU_MACH_AMDGCN_GFX602`` 0x03a ``gfx602``
1282 ``EF_AMDGPU_MACH_AMDGCN_GFX705`` 0x03b ``gfx705``
1283 ``EF_AMDGPU_MACH_AMDGCN_GFX805`` 0x03c ``gfx805``
1284 ``EF_AMDGPU_MACH_AMDGCN_GFX1035`` 0x03d ``gfx1035``
1285 ``EF_AMDGPU_MACH_AMDGCN_GFX1034`` 0x03e ``gfx1034``
1286 ``EF_AMDGPU_MACH_AMDGCN_GFX90A`` 0x03f ``gfx90a``
1287 ``EF_AMDGPU_MACH_AMDGCN_GFX940`` 0x040 ``gfx940``
1288 ``EF_AMDGPU_MACH_AMDGCN_GFX1100`` 0x041 ``gfx1100``
1289 ``EF_AMDGPU_MACH_AMDGCN_GFX1013`` 0x042 ``gfx1013``
1290 *reserved* 0x043 Reserved.
1291 ``EF_AMDGPU_MACH_AMDGCN_GFX1103`` 0x044 ``gfx1103``
1292 ``EF_AMDGPU_MACH_AMDGCN_GFX1036`` 0x045 ``gfx1036``
1293 ``EF_AMDGPU_MACH_AMDGCN_GFX1101`` 0x046 ``gfx1101``
1294 ``EF_AMDGPU_MACH_AMDGCN_GFX1102`` 0x047 ``gfx1102``
1295 ==================================== ========== =============================
1300 An AMDGPU target ELF code object has the standard ELF sections which include:
1302 .. table:: AMDGPU ELF Sections
1303 :name: amdgpu-elf-sections-table
1305 ================== ================ =================================
1306 Name Type Attributes
1307 ================== ================ =================================
1308 ``.bss`` ``SHT_NOBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
1309 ``.data`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
1310 ``.debug_``\ *\** ``SHT_PROGBITS`` *none*
1311 ``.dynamic`` ``SHT_DYNAMIC`` ``SHF_ALLOC``
1312 ``.dynstr`` ``SHT_PROGBITS`` ``SHF_ALLOC``
1313 ``.dynsym`` ``SHT_PROGBITS`` ``SHF_ALLOC``
1314 ``.got`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
1315 ``.hash`` ``SHT_HASH`` ``SHF_ALLOC``
1316 ``.note`` ``SHT_NOTE`` *none*
1317 ``.rela``\ *name* ``SHT_RELA`` *none*
1318 ``.rela.dyn`` ``SHT_RELA`` *none*
1319 ``.rodata`` ``SHT_PROGBITS`` ``SHF_ALLOC``
1320 ``.shstrtab`` ``SHT_STRTAB`` *none*
1321 ``.strtab`` ``SHT_STRTAB`` *none*
1322 ``.symtab`` ``SHT_SYMTAB`` *none*
1323 ``.text`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_EXECINSTR``
1324 ================== ================ =================================
1326 These sections have their standard meanings (see [ELF]_) and are only generated
1330 The standard DWARF sections. See :ref:`amdgpu-dwarf-debug-information` for
1331 information on the DWARF produced by the AMDGPU backend.
1333 ``.dynamic``, ``.dynstr``, ``.dynsym``, ``.hash``
1334 The standard sections used by a dynamic loader.
1337 See :ref:`amdgpu-note-records` for the note records supported by the AMDGPU
1340 ``.rela``\ *name*, ``.rela.dyn``
1341 For relocatable code objects, *name* is the name of the section that the
1342 relocation records apply. For example, ``.rela.text`` is the section name for
1343 relocation records associated with the ``.text`` section.
1345 For linked shared code objects, ``.rela.dyn`` contains all the relocation
1346 records from each of the relocatable code object's ``.rela``\ *name* sections.
1348 See :ref:`amdgpu-relocation-records` for the relocation records supported by
1352 The executable machine code for the kernels and functions they call. Generated
1353 as position independent code. See :ref:`amdgpu-code-conventions` for
1354 information on conventions used in the isa generation.
1356 .. _amdgpu-note-records:
1361 The AMDGPU backend code object contains ELF note records in the ``.note``
1362 section. The set of generated notes and their semantics depend on the code
1363 object version; see :ref:`amdgpu-note-records-v2` and
1364 :ref:`amdgpu-note-records-v3-onwards`.
1366 As required by ``ELFCLASS32`` and ``ELFCLASS64``, minimal zero-byte padding
1367 must be generated after the ``name`` field to ensure the ``desc`` field is 4
1368 byte aligned. In addition, minimal zero-byte padding must be generated to
1369 ensure the ``desc`` field size is a multiple of 4 bytes. The ``sh_addralign``
1370 field of the ``.note`` section must be at least 4 to indicate at least 8 byte
1373 .. _amdgpu-note-records-v2:
1375 Code Object V2 Note Records
1376 ~~~~~~~~~~~~~~~~~~~~~~~~~~~
1379 Code object V2 is not the default code object version emitted by
1380 this version of LLVM.
1382 The AMDGPU backend code object uses the following ELF note record in the
1383 ``.note`` section when compiling for code object V2.
1385 The note record vendor field is "AMD".
1387 Additional note records may be present, but any which are not documented here
1388 are deprecated and should not be used.
1390 .. table:: AMDGPU Code Object V2 ELF Note Records
1391 :name: amdgpu-elf-note-records-v2-table
1393 ===== ===================================== ======================================
1394 Name Type Description
1395 ===== ===================================== ======================================
1396 "AMD" ``NT_AMD_HSA_CODE_OBJECT_VERSION`` Code object version.
1397 "AMD" ``NT_AMD_HSA_HSAIL`` HSAIL properties generated by the HSAIL
1398 Finalizer and not the LLVM compiler.
1399 "AMD" ``NT_AMD_HSA_ISA_VERSION`` Target ISA version.
1400 "AMD" ``NT_AMD_HSA_METADATA`` Metadata null terminated string in
1401 YAML [YAML]_ textual format.
1402 "AMD" ``NT_AMD_HSA_ISA_NAME`` Target ISA name.
1403 ===== ===================================== ======================================
1407 .. table:: AMDGPU Code Object V2 ELF Note Record Enumeration Values
1408 :name: amdgpu-elf-note-record-enumeration-values-v2-table
1410 ===================================== =====
1412 ===================================== =====
1413 ``NT_AMD_HSA_CODE_OBJECT_VERSION`` 1
1414 ``NT_AMD_HSA_HSAIL`` 2
1415 ``NT_AMD_HSA_ISA_VERSION`` 3
1417 ``NT_AMD_HSA_METADATA`` 10
1418 ``NT_AMD_HSA_ISA_NAME`` 11
1419 ===================================== =====
1421 ``NT_AMD_HSA_CODE_OBJECT_VERSION``
1422 Specifies the code object version number. The description field has the
1427 struct amdgpu_hsa_note_code_object_version_s {
1428 uint32_t major_version;
1429 uint32_t minor_version;
1432 The ``major_version`` has a value less than or equal to 2.
1434 ``NT_AMD_HSA_HSAIL``
1435 Specifies the HSAIL properties used by the HSAIL Finalizer. The description
1436 field has the following layout:
1440 struct amdgpu_hsa_note_hsail_s {
1441 uint32_t hsail_major_version;
1442 uint32_t hsail_minor_version;
1444 uint8_t machine_model;
1445 uint8_t default_float_round;
1448 ``NT_AMD_HSA_ISA_VERSION``
1449 Specifies the target ISA version. The description field has the following layout:
1453 struct amdgpu_hsa_note_isa_s {
1454 uint16_t vendor_name_size;
1455 uint16_t architecture_name_size;
1459 char vendor_and_architecture_name[1];
1462 ``vendor_name_size`` and ``architecture_name_size`` are the length of the
1463 vendor and architecture names respectively, including the NUL character.
1465 ``vendor_and_architecture_name`` contains the NUL terminates string for the
1466 vendor, immediately followed by the NUL terminated string for the
1469 This note record is used by the HSA runtime loader.
1471 Code object V2 only supports a limited number of processors and has fixed
1472 settings for target features. See
1473 :ref:`amdgpu-elf-note-record-supported_processors-v2-table` for a list of
1474 processors and the corresponding target ID. In the table the note record ISA
1475 name is a concatenation of the vendor name, architecture name, major, minor,
1476 and stepping separated by a ":".
1478 The target ID column shows the processor name and fixed target features used
1479 by the LLVM compiler. The LLVM compiler does not generate a
1480 ``NT_AMD_HSA_HSAIL`` note record.
1482 A code object generated by the Finalizer also uses code object V2 and always
1483 generates a ``NT_AMD_HSA_HSAIL`` note record. The processor name and
1484 ``sramecc`` target feature is as shown in
1485 :ref:`amdgpu-elf-note-record-supported_processors-v2-table` but the ``xnack``
1486 target feature is specified by the ``EF_AMDGPU_FEATURE_XNACK_V2`` ``e_flags``
1489 ``NT_AMD_HSA_ISA_NAME``
1490 Specifies the target ISA name as a non-NUL terminated string.
1492 This note record is not used by the HSA runtime loader.
1494 See the ``NT_AMD_HSA_ISA_VERSION`` note record description of the code object
1495 V2's limited support of processors and fixed settings for target features.
1497 See :ref:`amdgpu-elf-note-record-supported_processors-v2-table` for a mapping
1498 from the string to the corresponding target ID. If the ``xnack`` target
1499 feature is supported and enabled, the string produced by the LLVM compiler
1500 will may have a ``+xnack`` appended. The Finlizer did not do the appending and
1501 instead used the ``EF_AMDGPU_FEATURE_XNACK_V2`` ``e_flags`` bit.
1503 ``NT_AMD_HSA_METADATA``
1504 Specifies extensible metadata associated with the code objects executed on HSA
1505 [HSA]_ compatible runtimes (see :ref:`amdgpu-os`). It is required when the
1506 target triple OS is ``amdhsa`` (see :ref:`amdgpu-target-triples`). See
1507 :ref:`amdgpu-amdhsa-code-object-metadata-v2` for the syntax of the code object
1510 .. table:: AMDGPU Code Object V2 Supported Processors and Fixed Target Feature Settings
1511 :name: amdgpu-elf-note-record-supported_processors-v2-table
1513 ===================== ==========================
1514 Note Record ISA Name Target ID
1515 ===================== ==========================
1516 ``AMD:AMDGPU:6:0:0`` ``gfx600``
1517 ``AMD:AMDGPU:6:0:1`` ``gfx601``
1518 ``AMD:AMDGPU:6:0:2`` ``gfx602``
1519 ``AMD:AMDGPU:7:0:0`` ``gfx700``
1520 ``AMD:AMDGPU:7:0:1`` ``gfx701``
1521 ``AMD:AMDGPU:7:0:2`` ``gfx702``
1522 ``AMD:AMDGPU:7:0:3`` ``gfx703``
1523 ``AMD:AMDGPU:7:0:4`` ``gfx704``
1524 ``AMD:AMDGPU:7:0:5`` ``gfx705``
1525 ``AMD:AMDGPU:8:0:0`` ``gfx802``
1526 ``AMD:AMDGPU:8:0:1`` ``gfx801:xnack+``
1527 ``AMD:AMDGPU:8:0:2`` ``gfx802``
1528 ``AMD:AMDGPU:8:0:3`` ``gfx803``
1529 ``AMD:AMDGPU:8:0:4`` ``gfx803``
1530 ``AMD:AMDGPU:8:0:5`` ``gfx805``
1531 ``AMD:AMDGPU:8:1:0`` ``gfx810:xnack+``
1532 ``AMD:AMDGPU:9:0:0`` ``gfx900:xnack-``
1533 ``AMD:AMDGPU:9:0:1`` ``gfx900:xnack+``
1534 ``AMD:AMDGPU:9:0:2`` ``gfx902:xnack-``
1535 ``AMD:AMDGPU:9:0:3`` ``gfx902:xnack+``
1536 ``AMD:AMDGPU:9:0:4`` ``gfx904:xnack-``
1537 ``AMD:AMDGPU:9:0:5`` ``gfx904:xnack+``
1538 ``AMD:AMDGPU:9:0:6`` ``gfx906:sramecc-:xnack-``
1539 ``AMD:AMDGPU:9:0:7`` ``gfx906:sramecc-:xnack+``
1540 ``AMD:AMDGPU:9:0:12`` ``gfx90c:xnack-``
1541 ===================== ==========================
1543 .. _amdgpu-note-records-v3-onwards:
1545 Code Object V3 and Above Note Records
1546 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1548 The AMDGPU backend code object uses the following ELF note record in the
1549 ``.note`` section when compiling for code object V3 and above.
1551 The note record vendor field is "AMDGPU".
1553 Additional note records may be present, but any which are not documented here
1554 are deprecated and should not be used.
1556 .. table:: AMDGPU Code Object V3 and Above ELF Note Records
1557 :name: amdgpu-elf-note-records-table-v3-onwards
1559 ======== ============================== ======================================
1560 Name Type Description
1561 ======== ============================== ======================================
1562 "AMDGPU" ``NT_AMDGPU_METADATA`` Metadata in Message Pack [MsgPack]_
1564 ======== ============================== ======================================
1568 .. table:: AMDGPU Code Object V3 and Above ELF Note Record Enumeration Values
1569 :name: amdgpu-elf-note-record-enumeration-values-table-v3-onwards
1571 ============================== =====
1573 ============================== =====
1575 ``NT_AMDGPU_METADATA`` 32
1576 ============================== =====
1578 ``NT_AMDGPU_METADATA``
1579 Specifies extensible metadata associated with an AMDGPU code object. It is
1580 encoded as a map in the Message Pack [MsgPack]_ binary data format. See
1581 :ref:`amdgpu-amdhsa-code-object-metadata-v3`,
1582 :ref:`amdgpu-amdhsa-code-object-metadata-v4` and
1583 :ref:`amdgpu-amdhsa-code-object-metadata-v5` for the map keys defined for the
1591 Symbols include the following:
1593 .. table:: AMDGPU ELF Symbols
1594 :name: amdgpu-elf-symbols-table
1596 ===================== ================== ================ ==================
1597 Name Type Section Description
1598 ===================== ================== ================ ==================
1599 *link-name* ``STT_OBJECT`` - ``.data`` Global variable
1602 *link-name*\ ``.kd`` ``STT_OBJECT`` - ``.rodata`` Kernel descriptor
1603 *link-name* ``STT_FUNC`` - ``.text`` Kernel entry point
1604 *link-name* ``STT_OBJECT`` - SHN_AMDGPU_LDS Global variable in LDS
1605 ===================== ================== ================ ==================
1608 Global variables both used and defined by the compilation unit.
1610 If the symbol is defined in the compilation unit then it is allocated in the
1611 appropriate section according to if it has initialized data or is readonly.
1613 If the symbol is external then its section is ``STN_UNDEF`` and the loader
1614 will resolve relocations using the definition provided by another code object
1615 or explicitly defined by the runtime.
1617 If the symbol resides in local/group memory (LDS) then its section is the
1618 special processor specific section name ``SHN_AMDGPU_LDS``, and the
1619 ``st_value`` field describes alignment requirements as it does for common
1624 Add description of linked shared object symbols. Seems undefined symbols
1625 are marked as STT_NOTYPE.
1628 Every HSA kernel has an associated kernel descriptor. It is the address of the
1629 kernel descriptor that is used in the AQL dispatch packet used to invoke the
1630 kernel, not the kernel entry point. The layout of the HSA kernel descriptor is
1631 defined in :ref:`amdgpu-amdhsa-kernel-descriptor`.
1634 Every HSA kernel also has a symbol for its machine code entry point.
1636 .. _amdgpu-relocation-records:
1641 AMDGPU backend generates ``Elf64_Rela`` relocation records. Supported
1642 relocatable fields are:
1645 This specifies a 32-bit field occupying 4 bytes with arbitrary byte
1646 alignment. These values use the same byte order as other word values in the
1647 AMDGPU architecture.
1650 This specifies a 64-bit field occupying 8 bytes with arbitrary byte
1651 alignment. These values use the same byte order as other word values in the
1652 AMDGPU architecture.
1654 Following notations are used for specifying relocation calculations:
1657 Represents the addend used to compute the value of the relocatable field.
1660 Represents the offset into the global offset table at which the relocation
1661 entry's symbol will reside during execution.
1664 Represents the address of the global offset table.
1667 Represents the place (section offset for ``et_rel`` or address for ``et_dyn``)
1668 of the storage unit being relocated (computed using ``r_offset``).
1671 Represents the value of the symbol whose index resides in the relocation
1672 entry. Relocations not using this must specify a symbol index of
1676 Represents the base address of a loaded executable or shared object which is
1677 the difference between the ELF address and the actual load address.
1678 Relocations using this are only valid in executable or shared objects.
1680 The following relocation types are supported:
1682 .. table:: AMDGPU ELF Relocation Records
1683 :name: amdgpu-elf-relocation-records-table
1685 ========================== ======= ===== ========== ==============================
1686 Relocation Type Kind Value Field Calculation
1687 ========================== ======= ===== ========== ==============================
1688 ``R_AMDGPU_NONE`` 0 *none* *none*
1689 ``R_AMDGPU_ABS32_LO`` Static, 1 ``word32`` (S + A) & 0xFFFFFFFF
1691 ``R_AMDGPU_ABS32_HI`` Static, 2 ``word32`` (S + A) >> 32
1693 ``R_AMDGPU_ABS64`` Static, 3 ``word64`` S + A
1695 ``R_AMDGPU_REL32`` Static 4 ``word32`` S + A - P
1696 ``R_AMDGPU_REL64`` Static 5 ``word64`` S + A - P
1697 ``R_AMDGPU_ABS32`` Static, 6 ``word32`` S + A
1699 ``R_AMDGPU_GOTPCREL`` Static 7 ``word32`` G + GOT + A - P
1700 ``R_AMDGPU_GOTPCREL32_LO`` Static 8 ``word32`` (G + GOT + A - P) & 0xFFFFFFFF
1701 ``R_AMDGPU_GOTPCREL32_HI`` Static 9 ``word32`` (G + GOT + A - P) >> 32
1702 ``R_AMDGPU_REL32_LO`` Static 10 ``word32`` (S + A - P) & 0xFFFFFFFF
1703 ``R_AMDGPU_REL32_HI`` Static 11 ``word32`` (S + A - P) >> 32
1705 ``R_AMDGPU_RELATIVE64`` Dynamic 13 ``word64`` B + A
1706 ``R_AMDGPU_REL16`` Static 14 ``word16`` ((S + A - P) - 4) / 4
1707 ========================== ======= ===== ========== ==============================
1709 ``R_AMDGPU_ABS32_LO`` and ``R_AMDGPU_ABS32_HI`` are only supported by
1710 the ``mesa3d`` OS, which does not support ``R_AMDGPU_ABS64``.
1712 There is no current OS loader support for 32-bit programs and so
1713 ``R_AMDGPU_ABS32`` is not used.
1715 .. _amdgpu-loaded-code-object-path-uniform-resource-identifier:
1717 Loaded Code Object Path Uniform Resource Identifier (URI)
1718 ---------------------------------------------------------
1720 The AMD GPU code object loader represents the path of the ELF shared object from
1721 which the code object was loaded as a textual Uniform Resource Identifier (URI).
1722 Note that the code object is the in memory loaded relocated form of the ELF
1723 shared object. Multiple code objects may be loaded at different memory
1724 addresses in the same process from the same ELF shared object.
1726 The loaded code object path URI syntax is defined by the following BNF syntax:
1730 code_object_uri ::== file_uri | memory_uri
1731 file_uri ::== "file://" file_path [ range_specifier ]
1732 memory_uri ::== "memory://" process_id range_specifier
1733 range_specifier ::== [ "#" | "?" ] "offset=" number "&" "size=" number
1734 file_path ::== URI_ENCODED_OS_FILE_PATH
1735 process_id ::== DECIMAL_NUMBER
1736 number ::== HEX_NUMBER | DECIMAL_NUMBER | OCTAL_NUMBER
1739 Is a C integral literal where hexadecimal values are prefixed by "0x" or "0X",
1740 and octal values by "0".
1743 Is the file's path specified as a URI encoded UTF-8 string. In URI encoding,
1744 every character that is not in the regular expression ``[a-zA-Z0-9/_.~-]`` is
1745 encoded as two uppercase hexadecimal digits proceeded by "%". Directories in
1746 the path are separated by "/".
1749 Is a 0-based byte offset to the start of the code object. For a file URI, it
1750 is from the start of the file specified by the ``file_path``, and if omitted
1751 defaults to 0. For a memory URI, it is the memory address and is required.
1754 Is the number of bytes in the code object. For a file URI, if omitted it
1755 defaults to the size of the file. It is required for a memory URI.
1758 Is the identity of the process owning the memory. For Linux it is the C
1759 unsigned integral decimal literal for the process ID (PID).
1765 file:///dir1/dir2/file1
1766 file:///dir3/dir4/file2#offset=0x2000&size=3000
1767 memory://1234#offset=0x20000&size=3000
1769 .. _amdgpu-dwarf-debug-information:
1771 DWARF Debug Information
1772 =======================
1776 This section describes **provisional support** for AMDGPU DWARF [DWARF]_ that
1777 is not currently fully implemented and is subject to change.
1779 AMDGPU generates DWARF [DWARF]_ debugging information ELF sections (see
1780 :ref:`amdgpu-elf-code-object`) which contain information that maps the code
1781 object executable code and data to the source language constructs. It can be
1782 used by tools such as debuggers and profilers. It uses features defined in
1783 :doc:`AMDGPUDwarfExtensionsForHeterogeneousDebugging` that are made available in
1784 DWARF Version 4 and DWARF Version 5 as an LLVM vendor extension.
1786 This section defines the AMDGPU target architecture specific DWARF mappings.
1788 .. _amdgpu-dwarf-register-identifier:
1793 This section defines the AMDGPU target architecture register numbers used in
1794 DWARF operation expressions (see DWARF Version 5 section 2.5 and
1795 :ref:`amdgpu-dwarf-operation-expressions`) and Call Frame Information
1796 instructions (see DWARF Version 5 section 6.4 and
1797 :ref:`amdgpu-dwarf-call-frame-information`).
1799 A single code object can contain code for kernels that have different wavefront
1800 sizes. The vector registers and some scalar registers are based on the wavefront
1801 size. AMDGPU defines distinct DWARF registers for each wavefront size. This
1802 simplifies the consumer of the DWARF so that each register has a fixed size,
1803 rather than being dynamic according to the wavefront size mode. Similarly,
1804 distinct DWARF registers are defined for those registers that vary in size
1805 according to the process address size. This allows a consumer to treat a
1806 specific AMDGPU processor as a single architecture regardless of how it is
1807 configured at run time. The compiler explicitly specifies the DWARF registers
1808 that match the mode in which the code it is generating will be executed.
1810 DWARF registers are encoded as numbers, which are mapped to architecture
1811 registers. The mapping for AMDGPU is defined in
1812 :ref:`amdgpu-dwarf-register-mapping-table`. All AMDGPU targets use the same
1815 .. table:: AMDGPU DWARF Register Mapping
1816 :name: amdgpu-dwarf-register-mapping-table
1818 ============== ================= ======== ==================================
1819 DWARF Register AMDGPU Register Bit Size Description
1820 ============== ================= ======== ==================================
1821 0 PC_32 32 Program Counter (PC) when
1822 executing in a 32-bit process
1823 address space. Used in the CFI to
1824 describe the PC of the calling
1826 1 EXEC_MASK_32 32 Execution Mask Register when
1827 executing in wavefront 32 mode.
1828 2-15 *Reserved* *Reserved for highly accessed
1829 registers using DWARF shortcut.*
1830 16 PC_64 64 Program Counter (PC) when
1831 executing in a 64-bit process
1832 address space. Used in the CFI to
1833 describe the PC of the calling
1835 17 EXEC_MASK_64 64 Execution Mask Register when
1836 executing in wavefront 64 mode.
1837 18-31 *Reserved* *Reserved for highly accessed
1838 registers using DWARF shortcut.*
1839 32-95 SGPR0-SGPR63 32 Scalar General Purpose
1841 96-127 *Reserved* *Reserved for frequently accessed
1842 registers using DWARF 1-byte ULEB.*
1843 128 STATUS 32 Status Register.
1844 129-511 *Reserved* *Reserved for future Scalar
1845 Architectural Registers.*
1846 512 VCC_32 32 Vector Condition Code Register
1847 when executing in wavefront 32
1849 513-767 *Reserved* *Reserved for future Vector
1850 Architectural Registers when
1851 executing in wavefront 32 mode.*
1852 768 VCC_64 64 Vector Condition Code Register
1853 when executing in wavefront 64
1855 769-1023 *Reserved* *Reserved for future Vector
1856 Architectural Registers when
1857 executing in wavefront 64 mode.*
1858 1024-1087 *Reserved* *Reserved for padding.*
1859 1088-1129 SGPR64-SGPR105 32 Scalar General Purpose Registers.
1860 1130-1535 *Reserved* *Reserved for future Scalar
1861 General Purpose Registers.*
1862 1536-1791 VGPR0-VGPR255 32*32 Vector General Purpose Registers
1863 when executing in wavefront 32
1865 1792-2047 *Reserved* *Reserved for future Vector
1866 General Purpose Registers when
1867 executing in wavefront 32 mode.*
1868 2048-2303 AGPR0-AGPR255 32*32 Vector Accumulation Registers
1869 when executing in wavefront 32
1871 2304-2559 *Reserved* *Reserved for future Vector
1872 Accumulation Registers when
1873 executing in wavefront 32 mode.*
1874 2560-2815 VGPR0-VGPR255 64*32 Vector General Purpose Registers
1875 when executing in wavefront 64
1877 2816-3071 *Reserved* *Reserved for future Vector
1878 General Purpose Registers when
1879 executing in wavefront 64 mode.*
1880 3072-3327 AGPR0-AGPR255 64*32 Vector Accumulation Registers
1881 when executing in wavefront 64
1883 3328-3583 *Reserved* *Reserved for future Vector
1884 Accumulation Registers when
1885 executing in wavefront 64 mode.*
1886 ============== ================= ======== ==================================
1888 The vector registers are represented as the full size for the wavefront. They
1889 are organized as consecutive dwords (32-bits), one per lane, with the dword at
1890 the least significant bit position corresponding to lane 0 and so forth. DWARF
1891 location expressions involving the ``DW_OP_LLVM_offset`` and
1892 ``DW_OP_LLVM_push_lane`` operations are used to select the part of the vector
1893 register corresponding to the lane that is executing the current thread of
1894 execution in languages that are implemented using a SIMD or SIMT execution
1897 If the wavefront size is 32 lanes then the wavefront 32 mode register
1898 definitions are used. If the wavefront size is 64 lanes then the wavefront 64
1899 mode register definitions are used. Some AMDGPU targets support executing in
1900 both wavefront 32 and wavefront 64 mode. The register definitions corresponding
1901 to the wavefront mode of the generated code will be used.
1903 If code is generated to execute in a 32-bit process address space, then the
1904 32-bit process address space register definitions are used. If code is generated
1905 to execute in a 64-bit process address space, then the 64-bit process address
1906 space register definitions are used. The ``amdgcn`` target only supports the
1907 64-bit process address space.
1909 .. _amdgpu-dwarf-memory-space-identifier:
1911 Memory Space Identifier
1912 -----------------------
1914 The DWARF memory space represents the source language memory space. See DWARF
1915 Version 5 section 2.12 which is updated by the *DWARF Extensions For
1916 Heterogeneous Debugging* section :ref:`amdgpu-dwarf-memory-spaces`.
1918 The DWARF memory space mapping used for AMDGPU is defined in
1919 :ref:`amdgpu-dwarf-memory-space-mapping-table`.
1921 .. table:: AMDGPU DWARF Memory Space Mapping
1922 :name: amdgpu-dwarf-memory-space-mapping-table
1924 =========================== ====== =================
1926 ---------------------------------- -----------------
1927 Memory Space Name Value Memory Space
1928 =========================== ====== =================
1929 ``DW_MSPACE_LLVM_none`` 0x0000 Generic (Flat)
1930 ``DW_MSPACE_LLVM_global`` 0x0001 Global
1931 ``DW_MSPACE_LLVM_constant`` 0x0002 Global
1932 ``DW_MSPACE_LLVM_group`` 0x0003 Local (group/LDS)
1933 ``DW_MSPACE_LLVM_private`` 0x0004 Private (Scratch)
1934 ``DW_MSPACE_AMDGPU_region`` 0x8000 Region (GDS)
1935 =========================== ====== =================
1937 The DWARF memory space values defined in the *DWARF Extensions For Heterogeneous
1938 Debugging* section :ref:`amdgpu-dwarf-memory-spaces` are used.
1940 In addition, ``DW_ADDR_AMDGPU_region`` is encoded as a vendor extension. This is
1941 available for use for the AMD extension for access to the hardware GDS memory
1942 which is scratchpad memory allocated per device.
1944 For AMDGPU if no ``DW_AT_LLVM_memory_space`` attribute is present, then the
1945 default memory space of ``DW_MSPACE_LLVM_none`` is used.
1947 See :ref:`amdgpu-dwarf-address-space-identifier` for information on the AMDGPU
1948 mapping of DWARF memory spaces to DWARF address spaces, including address size
1951 .. _amdgpu-dwarf-address-space-identifier:
1953 Address Space Identifier
1954 ------------------------
1956 DWARF address spaces correspond to target architecture specific linear
1957 addressable memory areas. See DWARF Version 5 section 2.12 and *DWARF Extensions
1958 For Heterogeneous Debugging* section :ref:`amdgpu-dwarf-address-spaces`.
1960 The DWARF address space mapping used for AMDGPU is defined in
1961 :ref:`amdgpu-dwarf-address-space-mapping-table`.
1963 .. table:: AMDGPU DWARF Address Space Mapping
1964 :name: amdgpu-dwarf-address-space-mapping-table
1966 ======================================= ===== ======= ======== ===================== =======================
1968 --------------------------------------- ----- ---------------- --------------------- -----------------------
1969 Address Space Name Value Address Bit Size LLVM IR Address Space
1970 --------------------------------------- ----- ------- -------- --------------------- -----------------------
1975 ======================================= ===== ======= ======== ===================== =======================
1976 ``DW_ASPACE_LLVM_none`` 0x00 64 32 Global *default address space*
1977 ``DW_ASPACE_AMDGPU_generic`` 0x01 64 32 Generic (Flat)
1978 ``DW_ASPACE_AMDGPU_region`` 0x02 32 32 Region (GDS)
1979 ``DW_ASPACE_AMDGPU_local`` 0x03 32 32 Local (group/LDS)
1981 ``DW_ASPACE_AMDGPU_private_lane`` 0x05 32 32 Private (Scratch) *focused lane*
1982 ``DW_ASPACE_AMDGPU_private_wave`` 0x06 32 32 Private (Scratch) *unswizzled wavefront*
1983 ======================================= ===== ======= ======== ===================== =======================
1985 See :ref:`amdgpu-address-spaces` for information on the AMDGPU LLVM IR address
1986 spaces including address size and NULL value.
1988 The ``DW_ASPACE_LLVM_none`` address space is the default target architecture
1989 address space used in DWARF operations that do not specify an address space. It
1990 therefore has to map to the global address space so that the ``DW_OP_addr*`` and
1991 related operations can refer to addresses in the program code.
1993 The ``DW_ASPACE_AMDGPU_generic`` address space allows location expressions to
1994 specify the flat address space. If the address corresponds to an address in the
1995 local address space, then it corresponds to the wavefront that is executing the
1996 focused thread of execution. If the address corresponds to an address in the
1997 private address space, then it corresponds to the lane that is executing the
1998 focused thread of execution for languages that are implemented using a SIMD or
1999 SIMT execution model.
2003 CUDA-like languages such as HIP that do not have address spaces in the
2004 language type system, but do allow variables to be allocated in different
2005 address spaces, need to explicitly specify the ``DW_ASPACE_AMDGPU_generic``
2006 address space in the DWARF expression operations as the default address space
2007 is the global address space.
2009 The ``DW_ASPACE_AMDGPU_local`` address space allows location expressions to
2010 specify the local address space corresponding to the wavefront that is executing
2011 the focused thread of execution.
2013 The ``DW_ASPACE_AMDGPU_private_lane`` address space allows location expressions
2014 to specify the private address space corresponding to the lane that is executing
2015 the focused thread of execution for languages that are implemented using a SIMD
2016 or SIMT execution model.
2018 The ``DW_ASPACE_AMDGPU_private_wave`` address space allows location expressions
2019 to specify the unswizzled private address space corresponding to the wavefront
2020 that is executing the focused thread of execution. The wavefront view of private
2021 memory is the per wavefront unswizzled backing memory layout defined in
2022 :ref:`amdgpu-address-spaces`, such that address 0 corresponds to the first
2023 location for the backing memory of the wavefront (namely the address is not
2024 offset by ``wavefront-scratch-base``). The following formula can be used to
2025 convert from a ``DW_ASPACE_AMDGPU_private_lane`` address to a
2026 ``DW_ASPACE_AMDGPU_private_wave`` address:
2030 private-address-wavefront =
2031 ((private-address-lane / 4) * wavefront-size * 4) +
2032 (wavefront-lane-id * 4) + (private-address-lane % 4)
2034 If the ``DW_ASPACE_AMDGPU_private_lane`` address is dword aligned, and the start
2035 of the dwords for each lane starting with lane 0 is required, then this
2040 private-address-wavefront =
2041 private-address-lane * wavefront-size
2043 A compiler can use the ``DW_ASPACE_AMDGPU_private_wave`` address space to read a
2044 complete spilled vector register back into a complete vector register in the
2045 CFI. The frame pointer can be a private lane address which is dword aligned,
2046 which can be shifted to multiply by the wavefront size, and then used to form a
2047 private wavefront address that gives a location for a contiguous set of dwords,
2048 one per lane, where the vector register dwords are spilled. The compiler knows
2049 the wavefront size since it generates the code. Note that the type of the
2050 address may have to be converted as the size of a
2051 ``DW_ASPACE_AMDGPU_private_lane`` address may be smaller than the size of a
2052 ``DW_ASPACE_AMDGPU_private_wave`` address.
2054 .. _amdgpu-dwarf-lane-identifier:
2059 DWARF lane identifies specify a target architecture lane position for hardware
2060 that executes in a SIMD or SIMT manner, and on which a source language maps its
2061 threads of execution onto those lanes. The DWARF lane identifier is pushed by
2062 the ``DW_OP_LLVM_push_lane`` DWARF expression operation. See DWARF Version 5
2063 section 2.5 which is updated by *DWARF Extensions For Heterogeneous Debugging*
2064 section :ref:`amdgpu-dwarf-operation-expressions`.
2066 For AMDGPU, the lane identifier corresponds to the hardware lane ID of a
2067 wavefront. It is numbered from 0 to the wavefront size minus 1.
2069 Operation Expressions
2070 ---------------------
2072 DWARF expressions are used to compute program values and the locations of
2073 program objects. See DWARF Version 5 section 2.5 and
2074 :ref:`amdgpu-dwarf-operation-expressions`.
2076 DWARF location descriptions describe how to access storage which includes memory
2077 and registers. When accessing storage on AMDGPU, bytes are ordered with least
2078 significant bytes first, and bits are ordered within bytes with least
2079 significant bits first.
2081 For AMDGPU CFI expressions, ``DW_OP_LLVM_select_bit_piece`` is used to describe
2082 unwinding vector registers that are spilled under the execution mask to memory:
2083 the zero-single location description is the vector register, and the one-single
2084 location description is the spilled memory location description. The
2085 ``DW_OP_LLVM_form_aspace_address`` is used to specify the address space of the
2086 memory location description.
2088 In AMDGPU expressions, ``DW_OP_LLVM_select_bit_piece`` is used by the
2089 ``DW_AT_LLVM_lane_pc`` attribute expression where divergent control flow is
2090 controlled by the execution mask. An undefined location description together
2091 with ``DW_OP_LLVM_extend`` is used to indicate the lane was not active on entry
2092 to the subprogram. See :ref:`amdgpu-dwarf-dw-at-llvm-lane-pc` for an example.
2094 Debugger Information Entry Attributes
2095 -------------------------------------
2097 This section describes how certain debugger information entry attributes are
2098 used by AMDGPU. See the sections in DWARF Version 5 section 3.3.5 and 3.1.1
2099 which are updated by *DWARF Extensions For Heterogeneous Debugging* section
2100 :ref:`amdgpu-dwarf-low-level-information` and
2101 :ref:`amdgpu-dwarf-full-and-partial-compilation-unit-entries`.
2103 .. _amdgpu-dwarf-dw-at-llvm-lane-pc:
2105 ``DW_AT_LLVM_lane_pc``
2106 ~~~~~~~~~~~~~~~~~~~~~~
2108 For AMDGPU, the ``DW_AT_LLVM_lane_pc`` attribute is used to specify the program
2109 location of the separate lanes of a SIMT thread.
2111 If the lane is an active lane then this will be the same as the current program
2114 If the lane is inactive, but was active on entry to the subprogram, then this is
2115 the program location in the subprogram at which execution of the lane is
2116 conceptual positioned.
2118 If the lane was not active on entry to the subprogram, then this will be the
2119 undefined location. A client debugger can check if the lane is part of a valid
2120 work-group by checking that the lane is in the range of the associated
2121 work-group within the grid, accounting for partial work-groups. If it is not,
2122 then the debugger can omit any information for the lane. Otherwise, the debugger
2123 may repeatedly unwind the stack and inspect the ``DW_AT_LLVM_lane_pc`` of the
2124 calling subprogram until it finds a non-undefined location. Conceptually the
2125 lane only has the call frames that it has a non-undefined
2126 ``DW_AT_LLVM_lane_pc``.
2128 The following example illustrates how the AMDGPU backend can generate a DWARF
2129 location list expression for the nested ``IF/THEN/ELSE`` structures of the
2130 following subprogram pseudo code for a target with 64 lanes per wavefront.
2152 The AMDGPU backend may generate the following pseudo LLVM MIR to manipulate the
2153 execution mask (``EXEC``) to linearize the control flow. The condition is
2154 evaluated to make a mask of the lanes for which the condition evaluates to true.
2155 First the ``THEN`` region is executed by setting the ``EXEC`` mask to the
2156 logical ``AND`` of the current ``EXEC`` mask with the condition mask. Then the
2157 ``ELSE`` region is executed by negating the ``EXEC`` mask and logical ``AND`` of
2158 the saved ``EXEC`` mask at the start of the region. After the ``IF/THEN/ELSE``
2159 region the ``EXEC`` mask is restored to the value it had at the beginning of the
2160 region. This is shown below. Other approaches are possible, but the basic
2161 concept is the same.
2194 To create the DWARF location list expression that defines the location
2195 description of a vector of lane program locations, the LLVM MIR ``DBG_VALUE``
2196 pseudo instruction can be used to annotate the linearized control flow. This can
2197 be done by defining an artificial variable for the lane PC. The DWARF location
2198 list expression created for it is used as the value of the
2199 ``DW_AT_LLVM_lane_pc`` attribute on the subprogram's debugger information entry.
2201 A DWARF procedure is defined for each well nested structured control flow region
2202 which provides the conceptual lane program location for a lane if it is not
2203 active (namely it is divergent). The DWARF operation expression for each region
2204 conceptually inherits the value of the immediately enclosing region and modifies
2205 it according to the semantics of the region.
2207 For an ``IF/THEN/ELSE`` region the divergent program location is at the start of
2208 the region for the ``THEN`` region since it is executed first. For the ``ELSE``
2209 region the divergent program location is at the end of the ``IF/THEN/ELSE``
2210 region since the ``THEN`` region has completed.
2212 The lane PC artificial variable is assigned at each region transition. It uses
2213 the immediately enclosing region's DWARF procedure to compute the program
2214 location for each lane assuming they are divergent, and then modifies the result
2215 by inserting the current program location for each lane that the ``EXEC`` mask
2216 indicates is active.
2218 By having separate DWARF procedures for each region, they can be reused to
2219 define the value for any nested region. This reduces the total size of the DWARF
2220 operation expressions.
2222 The following provides an example using pseudo LLVM MIR.
2228 DEFINE_DWARF %__uint_64 = DW_TAG_base_type[
2229 DW_AT_name = "__uint64";
2230 DW_AT_byte_size = 8;
2231 DW_AT_encoding = DW_ATE_unsigned;
2233 DEFINE_DWARF %__active_lane_pc = DW_TAG_dwarf_procedure[
2234 DW_AT_name = "__active_lane_pc";
2237 DW_OP_LLVM_extend 64, 64;
2238 DW_OP_regval_type EXEC, %uint_64;
2239 DW_OP_LLVM_select_bit_piece 64, 64;
2242 DEFINE_DWARF %__divergent_lane_pc = DW_TAG_dwarf_procedure[
2243 DW_AT_name = "__divergent_lane_pc";
2245 DW_OP_LLVM_undefined;
2246 DW_OP_LLVM_extend 64, 64;
2249 DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[
2250 DW_OP_call_ref %__divergent_lane_pc;
2251 DW_OP_call_ref %__active_lane_pc;
2255 DBG_VALUE %1, $noreg, %__lex_1_save_exec;
2260 DEFINE_DWARF %__divergent_lane_pc_1_then = DW_TAG_dwarf_procedure[
2261 DW_AT_name = "__divergent_lane_pc_1_then";
2262 DW_AT_location = DIExpression[
2263 DW_OP_call_ref %__divergent_lane_pc;
2264 DW_OP_addrx &lex_1_start;
2266 DW_OP_LLVM_extend 64, 64;
2267 DW_OP_call_ref %__lex_1_save_exec;
2268 DW_OP_deref_type 64, %__uint_64;
2269 DW_OP_LLVM_select_bit_piece 64, 64;
2272 DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[
2273 DW_OP_call_ref %__divergent_lane_pc_1_then;
2274 DW_OP_call_ref %__active_lane_pc;
2278 DBG_VALUE %3, %__lex_1_1_save_exec;
2283 DEFINE_DWARF %__divergent_lane_pc_1_1_then = DW_TAG_dwarf_procedure[
2284 DW_AT_name = "__divergent_lane_pc_1_1_then";
2285 DW_AT_location = DIExpression[
2286 DW_OP_call_ref %__divergent_lane_pc_1_then;
2287 DW_OP_addrx &lex_1_1_start;
2289 DW_OP_LLVM_extend 64, 64;
2290 DW_OP_call_ref %__lex_1_1_save_exec;
2291 DW_OP_deref_type 64, %__uint_64;
2292 DW_OP_LLVM_select_bit_piece 64, 64;
2295 DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[
2296 DW_OP_call_ref %__divergent_lane_pc_1_1_then;
2297 DW_OP_call_ref %__active_lane_pc;
2302 DEFINE_DWARF %__divergent_lane_pc_1_1_else = DW_TAG_dwarf_procedure[
2303 DW_AT_name = "__divergent_lane_pc_1_1_else";
2304 DW_AT_location = DIExpression[
2305 DW_OP_call_ref %__divergent_lane_pc_1_then;
2306 DW_OP_addrx &lex_1_1_end;
2308 DW_OP_LLVM_extend 64, 64;
2309 DW_OP_call_ref %__lex_1_1_save_exec;
2310 DW_OP_deref_type 64, %__uint_64;
2311 DW_OP_LLVM_select_bit_piece 64, 64;
2314 DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[
2315 DW_OP_call_ref %__divergent_lane_pc_1_1_else;
2316 DW_OP_call_ref %__active_lane_pc;
2321 DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[
2322 DW_OP_call_ref %__divergent_lane_pc;
2323 DW_OP_call_ref %__active_lane_pc;
2328 DEFINE_DWARF %__divergent_lane_pc_1_else = DW_TAG_dwarf_procedure[
2329 DW_AT_name = "__divergent_lane_pc_1_else";
2330 DW_AT_location = DIExpression[
2331 DW_OP_call_ref %__divergent_lane_pc;
2332 DW_OP_addrx &lex_1_end;
2334 DW_OP_LLVM_extend 64, 64;
2335 DW_OP_call_ref %__lex_1_save_exec;
2336 DW_OP_deref_type 64, %__uint_64;
2337 DW_OP_LLVM_select_bit_piece 64, 64;
2340 DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[
2341 DW_OP_call_ref %__divergent_lane_pc_1_else;
2342 DW_OP_call_ref %__active_lane_pc;
2347 DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc DIExpression[
2348 DW_OP_call_ref %__divergent_lane_pc;
2349 DW_OP_call_ref %__active_lane_pc;
2354 The DWARF procedure ``%__active_lane_pc`` is used to update the lane pc elements
2355 that are active, with the current program location.
2357 Artificial variables %__lex_1_save_exec and %__lex_1_1_save_exec are created for
2358 the execution masks saved on entry to a region. Using the ``DBG_VALUE`` pseudo
2359 instruction, location list entries will be created that describe where the
2360 artificial variables are allocated at any given program location. The compiler
2361 may allocate them to registers or spill them to memory.
2363 The DWARF procedures for each region use the values of the saved execution mask
2364 artificial variables to only update the lanes that are active on entry to the
2365 region. All other lanes retain the value of the enclosing region where they were
2366 last active. If they were not active on entry to the subprogram, then will have
2367 the undefined location description.
2369 Other structured control flow regions can be handled similarly. For example,
2370 loops would set the divergent program location for the region at the end of the
2371 loop. Any lanes active will be in the loop, and any lanes not active must have
2374 An ``IF/THEN/ELSEIF/ELSEIF/...`` region can be treated as a nest of
2375 ``IF/THEN/ELSE`` regions.
2377 The DWARF procedures can use the active lane artificial variable described in
2378 :ref:`amdgpu-dwarf-amdgpu-dw-at-llvm-active-lane` rather than the actual
2379 ``EXEC`` mask in order to support whole or quad wavefront mode.
2381 .. _amdgpu-dwarf-amdgpu-dw-at-llvm-active-lane:
2383 ``DW_AT_LLVM_active_lane``
2384 ~~~~~~~~~~~~~~~~~~~~~~~~~~
2386 The ``DW_AT_LLVM_active_lane`` attribute on a subprogram debugger information
2387 entry is used to specify the lanes that are conceptually active for a SIMT
2390 The execution mask may be modified to implement whole or quad wavefront mode
2391 operations. For example, all lanes may need to temporarily be made active to
2392 execute a whole wavefront operation. Such regions would save the ``EXEC`` mask,
2393 update it to enable the necessary lanes, perform the operations, and then
2394 restore the ``EXEC`` mask from the saved value. While executing the whole
2395 wavefront region, the conceptual execution mask is the saved value, not the
2398 This is handled by defining an artificial variable for the active lane mask. The
2399 active lane mask artificial variable would be the actual ``EXEC`` mask for
2400 normal regions, and the saved execution mask for regions where the mask is
2401 temporarily updated. The location list expression created for this artificial
2402 variable is used to define the value of the ``DW_AT_LLVM_active_lane``
2405 ``DW_AT_LLVM_augmentation``
2406 ~~~~~~~~~~~~~~~~~~~~~~~~~~~
2408 For AMDGPU, the ``DW_AT_LLVM_augmentation`` attribute of a compilation unit
2409 debugger information entry has the following value for the augmentation string:
2415 The "vX.Y" specifies the major X and minor Y version number of the AMDGPU
2416 extensions used in the DWARF of the compilation unit. The version number
2417 conforms to [SEMVER]_.
2419 Call Frame Information
2420 ----------------------
2422 DWARF Call Frame Information (CFI) describes how a consumer can virtually
2423 *unwind* call frames in a running process or core dump. See DWARF Version 5
2424 section 6.4 and :ref:`amdgpu-dwarf-call-frame-information`.
2426 For AMDGPU, the Common Information Entry (CIE) fields have the following values:
2428 1. ``augmentation`` string contains the following null-terminated UTF-8 string:
2434 The ``vX.Y`` specifies the major X and minor Y version number of the AMDGPU
2435 extensions used in this CIE or to the FDEs that use it. The version number
2436 conforms to [SEMVER]_.
2438 2. ``address_size`` for the ``Global`` address space is defined in
2439 :ref:`amdgpu-dwarf-address-space-identifier`.
2441 3. ``segment_selector_size`` is 0 as AMDGPU does not use a segment selector.
2443 4. ``code_alignment_factor`` is 4 bytes.
2447 Add to :ref:`amdgpu-processor-table` table.
2449 5. ``data_alignment_factor`` is 4 bytes.
2453 Add to :ref:`amdgpu-processor-table` table.
2455 6. ``return_address_register`` is ``PC_32`` for 32-bit processes and ``PC_64``
2456 for 64-bit processes defined in :ref:`amdgpu-dwarf-register-identifier`.
2458 7. ``initial_instructions`` Since a subprogram X with fewer registers can be
2459 called from subprogram Y that has more allocated, X will not change any of
2460 the extra registers as it cannot access them. Therefore, the default rule
2461 for all columns is ``same value``.
2463 For AMDGPU the register number follows the numbering defined in
2464 :ref:`amdgpu-dwarf-register-identifier`.
2466 For AMDGPU the instructions are variable size. A consumer can subtract 1 from
2467 the return address to get the address of a byte within the call site
2468 instructions. See DWARF Version 5 section 6.4.4.
2473 See DWARF Version 5 section 6.1.
2475 Lookup By Name Section Header
2476 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
2478 See DWARF Version 5 section 6.1.1.4.1 and :ref:`amdgpu-dwarf-lookup-by-name`.
2480 For AMDGPU the lookup by name section header table:
2482 ``augmentation_string_size`` (uword)
2484 Set to the length of the ``augmentation_string`` value which is always a
2487 ``augmentation_string`` (sequence of UTF-8 characters)
2489 Contains the following UTF-8 string null padded to a multiple of 4 bytes:
2495 The "vX.Y" specifies the major X and minor Y version number of the AMDGPU
2496 extensions used in the DWARF of this index. The version number conforms to
2501 This is different to the DWARF Version 5 definition that requires the first
2502 4 characters to be the vendor ID. But this is consistent with the other
2503 augmentation strings and does allow multiple vendor contributions. However,
2504 backwards compatibility may be more desirable.
2506 Lookup By Address Section Header
2507 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
2509 See DWARF Version 5 section 6.1.2.
2511 For AMDGPU the lookup by address section header table:
2513 ``address_size`` (ubyte)
2515 Match the address size for the ``Global`` address space defined in
2516 :ref:`amdgpu-dwarf-address-space-identifier`.
2518 ``segment_selector_size`` (ubyte)
2520 AMDGPU does not use a segment selector so this is 0. The entries in the
2521 ``.debug_aranges`` do not have a segment selector.
2523 Line Number Information
2524 -----------------------
2526 See DWARF Version 5 section 6.2 and :ref:`amdgpu-dwarf-line-number-information`.
2528 AMDGPU does not use the ``isa`` state machine registers and always sets it to 0.
2529 The instruction set must be obtained from the ELF file header ``e_flags`` field
2530 in the ``EF_AMDGPU_MACH`` bit position (see :ref:`ELF Header
2531 <amdgpu-elf-header>`). See DWARF Version 5 section 6.2.2.
2535 Should the ``isa`` state machine register be used to indicate if the code is
2536 in wavefront32 or wavefront64 mode? Or used to specify the architecture ISA?
2538 For AMDGPU the line number program header fields have the following values (see
2539 DWARF Version 5 section 6.2.4):
2541 ``address_size`` (ubyte)
2542 Matches the address size for the ``Global`` address space defined in
2543 :ref:`amdgpu-dwarf-address-space-identifier`.
2545 ``segment_selector_size`` (ubyte)
2546 AMDGPU does not use a segment selector so this is 0.
2548 ``minimum_instruction_length`` (ubyte)
2549 For GFX9-GFX11 this is 4.
2551 ``maximum_operations_per_instruction`` (ubyte)
2552 For GFX9-GFX11 this is 1.
2554 Source text for online-compiled programs (for example, those compiled by the
2555 OpenCL language runtime) may be embedded into the DWARF Version 5 line table.
2556 See DWARF Version 5 section 6.2.4.1 which is updated by *DWARF Extensions For
2557 Heterogeneous Debugging* section :ref:`DW_LNCT_LLVM_source
2558 <amdgpu-dwarf-line-number-information-dw-lnct-llvm-source>`.
2560 The Clang option used to control source embedding in AMDGPU is defined in
2561 :ref:`amdgpu-clang-debug-options-table`.
2563 .. table:: AMDGPU Clang Debug Options
2564 :name: amdgpu-clang-debug-options-table
2566 ==================== ==================================================
2567 Debug Flag Description
2568 ==================== ==================================================
2569 -g[no-]embed-source Enable/disable embedding source text in DWARF
2570 debug sections. Useful for environments where
2571 source cannot be written to disk, such as
2572 when performing online compilation.
2573 ==================== ==================================================
2578 Enable the embedded source.
2580 ``-gno-embed-source``
2581 Disable the embedded source.
2583 32-Bit and 64-Bit DWARF Formats
2584 -------------------------------
2586 See DWARF Version 5 section 7.4 and
2587 :ref:`amdgpu-dwarf-32-bit-and-64-bit-dwarf-formats`.
2591 * For the ``amdgcn`` target architecture only the 64-bit process address space
2594 * The producer can generate either 32-bit or 64-bit DWARF format. LLVM generates
2595 the 32-bit DWARF format.
2600 For AMDGPU the following values apply for each of the unit headers described in
2601 DWARF Version 5 sections 7.5.1.1, 7.5.1.2, and 7.5.1.3:
2603 ``address_size`` (ubyte)
2604 Matches the address size for the ``Global`` address space defined in
2605 :ref:`amdgpu-dwarf-address-space-identifier`.
2607 .. _amdgpu-code-conventions:
2612 This section provides code conventions used for each supported target triple OS
2613 (see :ref:`amdgpu-target-triples`).
2618 This section provides code conventions used when the target triple OS is
2619 ``amdhsa`` (see :ref:`amdgpu-target-triples`).
2621 .. _amdgpu-amdhsa-code-object-metadata:
2623 Code Object Metadata
2624 ~~~~~~~~~~~~~~~~~~~~
2626 The code object metadata specifies extensible metadata associated with the code
2627 objects executed on HSA [HSA]_ compatible runtimes (see :ref:`amdgpu-os`). The
2628 encoding and semantics of this metadata depends on the code object version; see
2629 :ref:`amdgpu-amdhsa-code-object-metadata-v2`,
2630 :ref:`amdgpu-amdhsa-code-object-metadata-v3`,
2631 :ref:`amdgpu-amdhsa-code-object-metadata-v4` and
2632 :ref:`amdgpu-amdhsa-code-object-metadata-v5`.
2634 Code object metadata is specified in a note record (see
2635 :ref:`amdgpu-note-records`) and is required when the target triple OS is
2636 ``amdhsa`` (see :ref:`amdgpu-target-triples`). It must contain the minimum
2637 information necessary to support the HSA compatible runtime kernel queries. For
2638 example, the segment sizes needed in a dispatch packet. In addition, a
2639 high-level language runtime may require other information to be included. For
2640 example, the AMD OpenCL runtime records kernel argument information.
2642 .. _amdgpu-amdhsa-code-object-metadata-v2:
2644 Code Object V2 Metadata
2645 +++++++++++++++++++++++
2648 Code object V2 is not the default code object version emitted by this version
2651 Code object V2 metadata is specified by the ``NT_AMD_HSA_METADATA`` note record
2652 (see :ref:`amdgpu-note-records-v2`).
2654 The metadata is specified as a YAML formatted string (see [YAML]_ and
2659 Is the string null terminated? It probably should not if YAML allows it to
2660 contain null characters, otherwise it should be.
2662 The metadata is represented as a single YAML document comprised of the mapping
2663 defined in table :ref:`amdgpu-amdhsa-code-object-metadata-map-v2-table` and
2666 For boolean values, the string values of ``false`` and ``true`` are used for
2667 false and true respectively.
2669 Additional information can be added to the mappings. To avoid conflicts, any
2670 non-AMD key names should be prefixed by "*vendor-name*.".
2672 .. table:: AMDHSA Code Object V2 Metadata Map
2673 :name: amdgpu-amdhsa-code-object-metadata-map-v2-table
2675 ========== ============== ========= =======================================
2676 String Key Value Type Required? Description
2677 ========== ============== ========= =======================================
2678 "Version" sequence of Required - The first integer is the major
2679 2 integers version. Currently 1.
2680 - The second integer is the minor
2681 version. Currently 0.
2682 "Printf" sequence of Each string is encoded information
2683 strings about a printf function call. The
2684 encoded information is organized as
2685 fields separated by colon (':'):
2687 ``ID:N:S[0]:S[1]:...:S[N-1]:FormatString``
2692 A 32-bit integer as a unique id for
2693 each printf function call
2696 A 32-bit integer equal to the number
2697 of arguments of printf function call
2700 ``S[i]`` (where i = 0, 1, ... , N-1)
2701 32-bit integers for the size in bytes
2702 of the i-th FormatString argument of
2703 the printf function call
2706 The format string passed to the
2707 printf function call.
2708 "Kernels" sequence of Required Sequence of the mappings for each
2709 mapping kernel in the code object. See
2710 :ref:`amdgpu-amdhsa-code-object-kernel-metadata-map-v2-table`
2711 for the definition of the mapping.
2712 ========== ============== ========= =======================================
2716 .. table:: AMDHSA Code Object V2 Kernel Metadata Map
2717 :name: amdgpu-amdhsa-code-object-kernel-metadata-map-v2-table
2719 ================= ============== ========= ================================
2720 String Key Value Type Required? Description
2721 ================= ============== ========= ================================
2722 "Name" string Required Source name of the kernel.
2723 "SymbolName" string Required Name of the kernel
2724 descriptor ELF symbol.
2725 "Language" string Source language of the kernel.
2733 "LanguageVersion" sequence of - The first integer is the major
2735 - The second integer is the
2737 "Attrs" mapping Mapping of kernel attributes.
2739 :ref:`amdgpu-amdhsa-code-object-kernel-attribute-metadata-map-v2-table`
2740 for the mapping definition.
2741 "Args" sequence of Sequence of mappings of the
2742 mapping kernel arguments. See
2743 :ref:`amdgpu-amdhsa-code-object-kernel-argument-metadata-map-v2-table`
2744 for the definition of the mapping.
2745 "CodeProps" mapping Mapping of properties related to
2746 the kernel code. See
2747 :ref:`amdgpu-amdhsa-code-object-kernel-code-properties-metadata-map-v2-table`
2748 for the mapping definition.
2749 ================= ============== ========= ================================
2753 .. table:: AMDHSA Code Object V2 Kernel Attribute Metadata Map
2754 :name: amdgpu-amdhsa-code-object-kernel-attribute-metadata-map-v2-table
2756 =================== ============== ========= ==============================
2757 String Key Value Type Required? Description
2758 =================== ============== ========= ==============================
2759 "ReqdWorkGroupSize" sequence of If not 0, 0, 0 then all values
2760 3 integers must be >=1 and the dispatch
2761 work-group size X, Y, Z must
2762 correspond to the specified
2763 values. Defaults to 0, 0, 0.
2765 Corresponds to the OpenCL
2766 ``reqd_work_group_size``
2768 "WorkGroupSizeHint" sequence of The dispatch work-group size
2769 3 integers X, Y, Z is likely to be the
2772 Corresponds to the OpenCL
2773 ``work_group_size_hint``
2775 "VecTypeHint" string The name of a scalar or vector
2778 Corresponds to the OpenCL
2779 ``vec_type_hint`` attribute.
2781 "RuntimeHandle" string The external symbol name
2782 associated with a kernel.
2783 OpenCL runtime allocates a
2784 global buffer for the symbol
2785 and saves the kernel's address
2786 to it, which is used for
2787 device side enqueueing. Only
2788 available for device side
2790 =================== ============== ========= ==============================
2794 .. table:: AMDHSA Code Object V2 Kernel Argument Metadata Map
2795 :name: amdgpu-amdhsa-code-object-kernel-argument-metadata-map-v2-table
2797 ================= ============== ========= ================================
2798 String Key Value Type Required? Description
2799 ================= ============== ========= ================================
2800 "Name" string Kernel argument name.
2801 "TypeName" string Kernel argument type name.
2802 "Size" integer Required Kernel argument size in bytes.
2803 "Align" integer Required Kernel argument alignment in
2804 bytes. Must be a power of two.
2805 "ValueKind" string Required Kernel argument kind that
2806 specifies how to set up the
2807 corresponding argument.
2811 The argument is copied
2812 directly into the kernarg.
2815 A global address space pointer
2816 to the buffer data is passed
2819 "DynamicSharedPointer"
2820 A group address space pointer
2821 to dynamically allocated LDS
2822 is passed in the kernarg.
2825 A global address space
2826 pointer to a S# is passed in
2830 A global address space
2831 pointer to a T# is passed in
2835 A global address space pointer
2836 to an OpenCL pipe is passed in
2840 A global address space pointer
2841 to an OpenCL device enqueue
2842 queue is passed in the
2845 "HiddenGlobalOffsetX"
2846 The OpenCL grid dispatch
2847 global offset for the X
2848 dimension is passed in the
2851 "HiddenGlobalOffsetY"
2852 The OpenCL grid dispatch
2853 global offset for the Y
2854 dimension is passed in the
2857 "HiddenGlobalOffsetZ"
2858 The OpenCL grid dispatch
2859 global offset for the Z
2860 dimension is passed in the
2864 An argument that is not used
2865 by the kernel. Space needs to
2866 be left for it, but it does
2867 not need to be set up.
2869 "HiddenPrintfBuffer"
2870 A global address space pointer
2871 to the runtime printf buffer
2872 is passed in kernarg. Mutually
2874 "HiddenHostcallBuffer".
2876 "HiddenHostcallBuffer"
2877 A global address space pointer
2878 to the runtime hostcall buffer
2879 is passed in kernarg. Mutually
2881 "HiddenPrintfBuffer".
2883 "HiddenDefaultQueue"
2884 A global address space pointer
2885 to the OpenCL device enqueue
2886 queue that should be used by
2887 the kernel by default is
2888 passed in the kernarg.
2890 "HiddenCompletionAction"
2891 A global address space pointer
2892 to help link enqueued kernels into
2893 the ancestor tree for determining
2894 when the parent kernel has finished.
2896 "HiddenMultiGridSyncArg"
2897 A global address space pointer for
2898 multi-grid synchronization is
2899 passed in the kernarg.
2901 "ValueType" string Unused and deprecated. This should no longer
2902 be emitted, but is accepted for compatibility.
2905 "PointeeAlign" integer Alignment in bytes of pointee
2906 type for pointer type kernel
2907 argument. Must be a power
2908 of 2. Only present if
2910 "DynamicSharedPointer".
2911 "AddrSpaceQual" string Kernel argument address space
2912 qualifier. Only present if
2913 "ValueKind" is "GlobalBuffer" or
2914 "DynamicSharedPointer". Values
2926 Is GlobalBuffer only Global
2928 DynamicSharedPointer always
2929 Local? Can HCC allow Generic?
2930 How can Private or Region
2933 "AccQual" string Kernel argument access
2934 qualifier. Only present if
2935 "ValueKind" is "Image" or
2948 "ActualAccQual" string The actual memory accesses
2949 performed by the kernel on the
2950 kernel argument. Only present if
2951 "ValueKind" is "GlobalBuffer",
2952 "Image", or "Pipe". This may be
2953 more restrictive than indicated
2954 by "AccQual" to reflect what the
2955 kernel actual does. If not
2956 present then the runtime must
2957 assume what is implied by
2958 "AccQual" and "IsConst". Values
2965 "IsConst" boolean Indicates if the kernel argument
2966 is const qualified. Only present
2970 "IsRestrict" boolean Indicates if the kernel argument
2971 is restrict qualified. Only
2972 present if "ValueKind" is
2975 "IsVolatile" boolean Indicates if the kernel argument
2976 is volatile qualified. Only
2977 present if "ValueKind" is
2980 "IsPipe" boolean Indicates if the kernel argument
2981 is pipe qualified. Only present
2982 if "ValueKind" is "Pipe".
2986 Can GlobalBuffer be pipe
2989 ================= ============== ========= ================================
2993 .. table:: AMDHSA Code Object V2 Kernel Code Properties Metadata Map
2994 :name: amdgpu-amdhsa-code-object-kernel-code-properties-metadata-map-v2-table
2996 ============================ ============== ========= =====================
2997 String Key Value Type Required? Description
2998 ============================ ============== ========= =====================
2999 "KernargSegmentSize" integer Required The size in bytes of
3001 that holds the values
3004 "GroupSegmentFixedSize" integer Required The amount of group
3008 bytes. This does not
3010 dynamically allocated
3011 group segment memory
3015 "PrivateSegmentFixedSize" integer Required The amount of fixed
3016 private address space
3017 memory required for a
3019 bytes. If the kernel
3021 stack then additional
3023 to this value for the
3025 "KernargSegmentAlign" integer Required The maximum byte
3028 kernarg segment. Must
3030 "WavefrontSize" integer Required Wavefront size. Must
3032 "NumSGPRs" integer Required Number of scalar
3036 includes the special
3038 Scratch (GFX7-GFX10)
3040 GFX8-GFX10). It does
3042 SGPR added if a trap
3048 "NumVGPRs" integer Required Number of vector
3052 "MaxFlatWorkGroupSize" integer Required Maximum flat
3055 kernel in work-items.
3058 ReqdWorkGroupSize if
3060 "NumSpilledSGPRs" integer Number of stores from
3061 a scalar register to
3062 a register allocator
3065 "NumSpilledVGPRs" integer Number of stores from
3066 a vector register to
3067 a register allocator
3070 ============================ ============== ========= =====================
3072 .. _amdgpu-amdhsa-code-object-metadata-v3:
3074 Code Object V3 Metadata
3075 +++++++++++++++++++++++
3078 Code object V3 is not the default code object version emitted by this version
3081 Code object V3 and above metadata is specified by the ``NT_AMDGPU_METADATA`` note
3082 record (see :ref:`amdgpu-note-records-v3-onwards`).
3084 The metadata is represented as Message Pack formatted binary data (see
3085 [MsgPack]_). The top level is a Message Pack map that includes the
3086 keys defined in table
3087 :ref:`amdgpu-amdhsa-code-object-metadata-map-table-v3` and referenced
3090 Additional information can be added to the maps. To avoid conflicts,
3091 any key names should be prefixed by "*vendor-name*." where
3092 ``vendor-name`` can be the name of the vendor and specific vendor
3093 tool that generates the information. The prefix is abbreviated to
3094 simply "." when it appears within a map that has been added by the
3097 .. table:: AMDHSA Code Object V3 Metadata Map
3098 :name: amdgpu-amdhsa-code-object-metadata-map-table-v3
3100 ================= ============== ========= =======================================
3101 String Key Value Type Required? Description
3102 ================= ============== ========= =======================================
3103 "amdhsa.version" sequence of Required - The first integer is the major
3104 2 integers version. Currently 1.
3105 - The second integer is the minor
3106 version. Currently 0.
3107 "amdhsa.printf" sequence of Each string is encoded information
3108 strings about a printf function call. The
3109 encoded information is organized as
3110 fields separated by colon (':'):
3112 ``ID:N:S[0]:S[1]:...:S[N-1]:FormatString``
3117 A 32-bit integer as a unique id for
3118 each printf function call
3121 A 32-bit integer equal to the number
3122 of arguments of printf function call
3125 ``S[i]`` (where i = 0, 1, ... , N-1)
3126 32-bit integers for the size in bytes
3127 of the i-th FormatString argument of
3128 the printf function call
3131 The format string passed to the
3132 printf function call.
3133 "amdhsa.kernels" sequence of Required Sequence of the maps for each
3134 map kernel in the code object. See
3135 :ref:`amdgpu-amdhsa-code-object-kernel-metadata-map-table-v3`
3136 for the definition of the keys included
3138 ================= ============== ========= =======================================
3142 .. table:: AMDHSA Code Object V3 Kernel Metadata Map
3143 :name: amdgpu-amdhsa-code-object-kernel-metadata-map-table-v3
3145 =================================== ============== ========= ================================
3146 String Key Value Type Required? Description
3147 =================================== ============== ========= ================================
3148 ".name" string Required Source name of the kernel.
3149 ".symbol" string Required Name of the kernel
3150 descriptor ELF symbol.
3151 ".language" string Source language of the kernel.
3161 ".language_version" sequence of - The first integer is the major
3163 - The second integer is the
3165 ".args" sequence of Sequence of maps of the
3166 map kernel arguments. See
3167 :ref:`amdgpu-amdhsa-code-object-kernel-argument-metadata-map-table-v3`
3168 for the definition of the keys
3169 included in that map.
3170 ".reqd_workgroup_size" sequence of If not 0, 0, 0 then all values
3171 3 integers must be >=1 and the dispatch
3172 work-group size X, Y, Z must
3173 correspond to the specified
3174 values. Defaults to 0, 0, 0.
3176 Corresponds to the OpenCL
3177 ``reqd_work_group_size``
3179 ".workgroup_size_hint" sequence of The dispatch work-group size
3180 3 integers X, Y, Z is likely to be the
3183 Corresponds to the OpenCL
3184 ``work_group_size_hint``
3186 ".vec_type_hint" string The name of a scalar or vector
3189 Corresponds to the OpenCL
3190 ``vec_type_hint`` attribute.
3192 ".device_enqueue_symbol" string The external symbol name
3193 associated with a kernel.
3194 OpenCL runtime allocates a
3195 global buffer for the symbol
3196 and saves the kernel's address
3197 to it, which is used for
3198 device side enqueueing. Only
3199 available for device side
3201 ".kernarg_segment_size" integer Required The size in bytes of
3203 that holds the values
3206 ".group_segment_fixed_size" integer Required The amount of group
3210 bytes. This does not
3212 dynamically allocated
3213 group segment memory
3217 ".private_segment_fixed_size" integer Required The amount of fixed
3218 private address space
3219 memory required for a
3221 bytes. If the kernel
3223 stack then additional
3225 to this value for the
3227 ".kernarg_segment_align" integer Required The maximum byte
3230 kernarg segment. Must
3232 ".wavefront_size" integer Required Wavefront size. Must
3234 ".sgpr_count" integer Required Number of scalar
3235 registers required by a
3237 GFX6-GFX9. A register
3238 is required if it is
3240 if a higher numbered
3243 includes the special
3249 SGPR added if a trap
3255 ".vgpr_count" integer Required Number of vector
3256 registers required by
3258 GFX6-GFX9. A register
3259 is required if it is
3261 if a higher numbered
3264 ".agpr_count" integer Required Number of accumulator
3265 registers required by
3268 ".max_flat_workgroup_size" integer Required Maximum flat
3271 kernel in work-items.
3274 ReqdWorkGroupSize if
3276 ".sgpr_spill_count" integer Number of stores from
3277 a scalar register to
3278 a register allocator
3281 ".vgpr_spill_count" integer Number of stores from
3282 a vector register to
3283 a register allocator
3286 ".kind" string The kind of the kernel
3294 These kernels must be
3295 invoked after loading
3305 These kernels must be
3308 containing code object
3309 and after all init and
3310 normal kernels in the
3311 same code object have
3315 If omitted, "normal" is
3317 =================================== ============== ========= ================================
3321 .. table:: AMDHSA Code Object V3 Kernel Argument Metadata Map
3322 :name: amdgpu-amdhsa-code-object-kernel-argument-metadata-map-table-v3
3324 ====================== ============== ========= ================================
3325 String Key Value Type Required? Description
3326 ====================== ============== ========= ================================
3327 ".name" string Kernel argument name.
3328 ".type_name" string Kernel argument type name.
3329 ".size" integer Required Kernel argument size in bytes.
3330 ".offset" integer Required Kernel argument offset in
3331 bytes. The offset must be a
3332 multiple of the alignment
3333 required by the argument.
3334 ".value_kind" string Required Kernel argument kind that
3335 specifies how to set up the
3336 corresponding argument.
3340 The argument is copied
3341 directly into the kernarg.
3344 A global address space pointer
3345 to the buffer data is passed
3348 "dynamic_shared_pointer"
3349 A group address space pointer
3350 to dynamically allocated LDS
3351 is passed in the kernarg.
3354 A global address space
3355 pointer to a S# is passed in
3359 A global address space
3360 pointer to a T# is passed in
3364 A global address space pointer
3365 to an OpenCL pipe is passed in
3369 A global address space pointer
3370 to an OpenCL device enqueue
3371 queue is passed in the
3374 "hidden_global_offset_x"
3375 The OpenCL grid dispatch
3376 global offset for the X
3377 dimension is passed in the
3380 "hidden_global_offset_y"
3381 The OpenCL grid dispatch
3382 global offset for the Y
3383 dimension is passed in the
3386 "hidden_global_offset_z"
3387 The OpenCL grid dispatch
3388 global offset for the Z
3389 dimension is passed in the
3393 An argument that is not used
3394 by the kernel. Space needs to
3395 be left for it, but it does
3396 not need to be set up.
3398 "hidden_printf_buffer"
3399 A global address space pointer
3400 to the runtime printf buffer
3401 is passed in kernarg. Mutually
3403 "hidden_hostcall_buffer"
3404 before Code Object V5.
3406 "hidden_hostcall_buffer"
3407 A global address space pointer
3408 to the runtime hostcall buffer
3409 is passed in kernarg. Mutually
3411 "hidden_printf_buffer"
3412 before Code Object V5.
3414 "hidden_default_queue"
3415 A global address space pointer
3416 to the OpenCL device enqueue
3417 queue that should be used by
3418 the kernel by default is
3419 passed in the kernarg.
3421 "hidden_completion_action"
3422 A global address space pointer
3423 to help link enqueued kernels into
3424 the ancestor tree for determining
3425 when the parent kernel has finished.
3427 "hidden_multigrid_sync_arg"
3428 A global address space pointer for
3429 multi-grid synchronization is
3430 passed in the kernarg.
3432 ".value_type" string Unused and deprecated. This should no longer
3433 be emitted, but is accepted for compatibility.
3435 ".pointee_align" integer Alignment in bytes of pointee
3436 type for pointer type kernel
3437 argument. Must be a power
3438 of 2. Only present if
3440 "dynamic_shared_pointer".
3441 ".address_space" string Kernel argument address space
3442 qualifier. Only present if
3443 ".value_kind" is "global_buffer" or
3444 "dynamic_shared_pointer". Values
3456 Is "global_buffer" only "global"
3458 "dynamic_shared_pointer" always
3459 "local"? Can HCC allow "generic"?
3460 How can "private" or "region"
3463 ".access" string Kernel argument access
3464 qualifier. Only present if
3465 ".value_kind" is "image" or
3478 ".actual_access" string The actual memory accesses
3479 performed by the kernel on the
3480 kernel argument. Only present if
3481 ".value_kind" is "global_buffer",
3482 "image", or "pipe". This may be
3483 more restrictive than indicated
3484 by ".access" to reflect what the
3485 kernel actual does. If not
3486 present then the runtime must
3487 assume what is implied by
3488 ".access" and ".is_const" . Values
3495 ".is_const" boolean Indicates if the kernel argument
3496 is const qualified. Only present
3500 ".is_restrict" boolean Indicates if the kernel argument
3501 is restrict qualified. Only
3502 present if ".value_kind" is
3505 ".is_volatile" boolean Indicates if the kernel argument
3506 is volatile qualified. Only
3507 present if ".value_kind" is
3510 ".is_pipe" boolean Indicates if the kernel argument
3511 is pipe qualified. Only present
3512 if ".value_kind" is "pipe".
3516 Can "global_buffer" be pipe
3519 ====================== ============== ========= ================================
3521 .. _amdgpu-amdhsa-code-object-metadata-v4:
3523 Code Object V4 Metadata
3524 +++++++++++++++++++++++
3526 Code object V4 metadata is the same as
3527 :ref:`amdgpu-amdhsa-code-object-metadata-v3` with the changes and additions
3528 defined in table :ref:`amdgpu-amdhsa-code-object-metadata-map-table-v4`.
3530 .. table:: AMDHSA Code Object V4 Metadata Map Changes
3531 :name: amdgpu-amdhsa-code-object-metadata-map-table-v4
3533 ================= ============== ========= =======================================
3534 String Key Value Type Required? Description
3535 ================= ============== ========= =======================================
3536 "amdhsa.version" sequence of Required - The first integer is the major
3537 2 integers version. Currently 1.
3538 - The second integer is the minor
3539 version. Currently 1.
3540 "amdhsa.target" string Required The target name of the code using the syntax:
3544 <target-triple> [ "-" <target-id> ]
3546 A canonical target ID must be
3547 used. See :ref:`amdgpu-target-triples`
3548 and :ref:`amdgpu-target-id`.
3549 ================= ============== ========= =======================================
3551 .. _amdgpu-amdhsa-code-object-metadata-v5:
3553 Code Object V5 Metadata
3554 +++++++++++++++++++++++
3557 Code object V5 is not the default code object version emitted by this version
3561 Code object V5 metadata is the same as
3562 :ref:`amdgpu-amdhsa-code-object-metadata-v4` with the changes defined in table
3563 :ref:`amdgpu-amdhsa-code-object-metadata-map-table-v5`, table
3564 :ref:`amdgpu-amdhsa-code-object-kernel-metadata-map-table-v5` and table
3565 :ref:`amdgpu-amdhsa-code-object-kernel-argument-metadata-map-table-v5`.
3567 .. table:: AMDHSA Code Object V5 Metadata Map Changes
3568 :name: amdgpu-amdhsa-code-object-metadata-map-table-v5
3570 ================= ============== ========= =======================================
3571 String Key Value Type Required? Description
3572 ================= ============== ========= =======================================
3573 "amdhsa.version" sequence of Required - The first integer is the major
3574 2 integers version. Currently 1.
3575 - The second integer is the minor
3576 version. Currently 2.
3577 ================= ============== ========= =======================================
3581 .. table:: AMDHSA Code Object V5 Kernel Metadata Map Additions
3582 :name: amdgpu-amdhsa-code-object-kernel-metadata-map-table-v5
3584 ============================= ============= ========== =======================================
3585 String Key Value Type Required? Description
3586 ============================= ============= ========== =======================================
3587 ".uses_dynamic_stack" boolean Indicates if the generated machine code
3588 is using a dynamically sized stack.
3589 ".workgroup_processor_mode" boolean (GFX10+) Controls ENABLE_WGP_MODE in
3590 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
3591 ============================= ============= ========== =======================================
3595 .. table:: AMDHSA Code Object V5 Kernel Attribute Metadata Map
3596 :name: amdgpu-amdhsa-code-object-kernel-attribute-metadata-map-v5-table
3598 =========================== ============== ========= ==============================
3599 String Key Value Type Required? Description
3600 =========================== ============== ========= ==============================
3601 ".uniform_work_group_size" integer Indicates if the kernel
3602 requires that each dimension
3603 of global size is a multiple
3604 of corresponding dimension of
3605 work-group size. Value of 1
3606 implies true and value of 0
3607 implies false. Metadata is
3608 only emitted when value is 1.
3609 =========================== ============== ========= ==============================
3615 .. table:: AMDHSA Code Object V5 Kernel Argument Metadata Map Additions and Changes
3616 :name: amdgpu-amdhsa-code-object-kernel-argument-metadata-map-table-v5
3618 ====================== ============== ========= ================================
3619 String Key Value Type Required? Description
3620 ====================== ============== ========= ================================
3621 ".value_kind" string Required Kernel argument kind that
3622 specifies how to set up the
3623 corresponding argument.
3625 the same as code object V3 metadata
3626 (see :ref:`amdgpu-amdhsa-code-object-kernel-argument-metadata-map-table-v3`)
3627 with the following additions:
3629 "hidden_block_count_x"
3630 The grid dispatch work-group count for the X dimension
3631 is passed in the kernarg. Some languages, such as OpenCL,
3632 support a last work-group in each dimension being partial.
3633 This count only includes the non-partial work-group count.
3634 This is not the same as the value in the AQL dispatch packet,
3635 which has the grid size in work-items.
3637 "hidden_block_count_y"
3638 The grid dispatch work-group count for the Y dimension
3639 is passed in the kernarg. Some languages, such as OpenCL,
3640 support a last work-group in each dimension being partial.
3641 This count only includes the non-partial work-group count.
3642 This is not the same as the value in the AQL dispatch packet,
3643 which has the grid size in work-items. If the grid dimensionality
3644 is 1, then must be 1.
3646 "hidden_block_count_z"
3647 The grid dispatch work-group count for the Z dimension
3648 is passed in the kernarg. Some languages, such as OpenCL,
3649 support a last work-group in each dimension being partial.
3650 This count only includes the non-partial work-group count.
3651 This is not the same as the value in the AQL dispatch packet,
3652 which has the grid size in work-items. If the grid dimensionality
3653 is 1 or 2, then must be 1.
3655 "hidden_group_size_x"
3656 The grid dispatch work-group size for the X dimension is
3657 passed in the kernarg. This size only applies to the
3658 non-partial work-groups. This is the same value as the AQL
3659 dispatch packet work-group size.
3661 "hidden_group_size_y"
3662 The grid dispatch work-group size for the Y dimension is
3663 passed in the kernarg. This size only applies to the
3664 non-partial work-groups. This is the same value as the AQL
3665 dispatch packet work-group size. If the grid dimensionality
3666 is 1, then must be 1.
3668 "hidden_group_size_z"
3669 The grid dispatch work-group size for the Z dimension is
3670 passed in the kernarg. This size only applies to the
3671 non-partial work-groups. This is the same value as the AQL
3672 dispatch packet work-group size. If the grid dimensionality
3673 is 1 or 2, then must be 1.
3675 "hidden_remainder_x"
3676 The grid dispatch work group size of the partial work group
3677 of the X dimension, if it exists. Must be zero if a partial
3678 work group does not exist in the X dimension.
3680 "hidden_remainder_y"
3681 The grid dispatch work group size of the partial work group
3682 of the Y dimension, if it exists. Must be zero if a partial
3683 work group does not exist in the Y dimension.
3685 "hidden_remainder_z"
3686 The grid dispatch work group size of the partial work group
3687 of the Z dimension, if it exists. Must be zero if a partial
3688 work group does not exist in the Z dimension.
3691 The grid dispatch dimensionality. This is the same value
3692 as the AQL dispatch packet dimensionality. Must be a value
3696 A global address space pointer to an initialized memory
3697 buffer that conforms to the requirements of the malloc/free
3698 device library V1 version implementation.
3700 "hidden_private_base"
3701 The high 32 bits of the flat addressing private aperture base.
3702 Only used by GFX8 to allow conversion between private segment
3703 and flat addresses. See :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`.
3705 "hidden_shared_base"
3706 The high 32 bits of the flat addressing shared aperture base.
3707 Only used by GFX8 to allow conversion between shared segment
3708 and flat addresses. See :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`.
3711 A global memory address space pointer to the ROCm runtime
3712 ``struct amd_queue_t`` structure for the HSA queue of the
3713 associated dispatch AQL packet. It is only required for pre-GFX9
3714 devices for the trap handler ABI (see :ref:`amdgpu-amdhsa-trap-handler-abi`).
3716 ====================== ============== ========= ================================
3723 The HSA architected queuing language (AQL) defines a user space memory interface
3724 that can be used to control the dispatch of kernels, in an agent independent
3725 way. An agent can have zero or more AQL queues created for it using an HSA
3726 compatible runtime (see :ref:`amdgpu-os`), in which AQL packets (all of which
3727 are 64 bytes) can be placed. See the *HSA Platform System Architecture
3728 Specification* [HSA]_ for the AQL queue mechanics and packet layouts.
3730 The packet processor of a kernel agent is responsible for detecting and
3731 dispatching HSA kernels from the AQL queues associated with it. For AMD GPUs the
3732 packet processor is implemented by the hardware command processor (CP),
3733 asynchronous dispatch controller (ADC) and shader processor input controller
3736 An HSA compatible runtime can be used to allocate an AQL queue object. It uses
3737 the kernel mode driver to initialize and register the AQL queue with CP.
3739 To dispatch a kernel the following actions are performed. This can occur in the
3740 CPU host program, or from an HSA kernel executing on a GPU.
3742 1. A pointer to an AQL queue for the kernel agent on which the kernel is to be
3743 executed is obtained.
3744 2. A pointer to the kernel descriptor (see
3745 :ref:`amdgpu-amdhsa-kernel-descriptor`) of the kernel to execute is obtained.
3746 It must be for a kernel that is contained in a code object that was loaded
3747 by an HSA compatible runtime on the kernel agent with which the AQL queue is
3749 3. Space is allocated for the kernel arguments using the HSA compatible runtime
3750 allocator for a memory region with the kernarg property for the kernel agent
3751 that will execute the kernel. It must be at least 16-byte aligned.
3752 4. Kernel argument values are assigned to the kernel argument memory
3753 allocation. The layout is defined in the *HSA Programmer's Language
3754 Reference* [HSA]_. For AMDGPU the kernel execution directly accesses the
3755 kernel argument memory in the same way constant memory is accessed. (Note
3756 that the HSA specification allows an implementation to copy the kernel
3757 argument contents to another location that is accessed by the kernel.)
3758 5. An AQL kernel dispatch packet is created on the AQL queue. The HSA compatible
3759 runtime api uses 64-bit atomic operations to reserve space in the AQL queue
3760 for the packet. The packet must be set up, and the final write must use an
3761 atomic store release to set the packet kind to ensure the packet contents are
3762 visible to the kernel agent. AQL defines a doorbell signal mechanism to
3763 notify the kernel agent that the AQL queue has been updated. These rules, and
3764 the layout of the AQL queue and kernel dispatch packet is defined in the *HSA
3765 System Architecture Specification* [HSA]_.
3766 6. A kernel dispatch packet includes information about the actual dispatch,
3767 such as grid and work-group size, together with information from the code
3768 object about the kernel, such as segment sizes. The HSA compatible runtime
3769 queries on the kernel symbol can be used to obtain the code object values
3770 which are recorded in the :ref:`amdgpu-amdhsa-code-object-metadata`.
3771 7. CP executes micro-code and is responsible for detecting and setting up the
3772 GPU to execute the wavefronts of a kernel dispatch.
3773 8. CP ensures that when the a wavefront starts executing the kernel machine
3774 code, the scalar general purpose registers (SGPR) and vector general purpose
3775 registers (VGPR) are set up as required by the machine code. The required
3776 setup is defined in the :ref:`amdgpu-amdhsa-kernel-descriptor`. The initial
3777 register state is defined in
3778 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`.
3779 9. The prolog of the kernel machine code (see
3780 :ref:`amdgpu-amdhsa-kernel-prolog`) sets up the machine state as necessary
3781 before continuing executing the machine code that corresponds to the kernel.
3782 10. When the kernel dispatch has completed execution, CP signals the completion
3783 signal specified in the kernel dispatch packet if not 0.
3785 .. _amdgpu-amdhsa-memory-spaces:
3790 The memory space properties are:
3792 .. table:: AMDHSA Memory Spaces
3793 :name: amdgpu-amdhsa-memory-spaces-table
3795 ================= =========== ======== ======= ==================
3796 Memory Space Name HSA Segment Hardware Address NULL Value
3798 ================= =========== ======== ======= ==================
3799 Private private scratch 32 0x00000000
3800 Local group LDS 32 0xFFFFFFFF
3801 Global global global 64 0x0000000000000000
3802 Constant constant *same as 64 0x0000000000000000
3804 Generic flat flat 64 0x0000000000000000
3805 Region N/A GDS 32 *not implemented
3807 ================= =========== ======== ======= ==================
3809 The global and constant memory spaces both use global virtual addresses, which
3810 are the same virtual address space used by the CPU. However, some virtual
3811 addresses may only be accessible to the CPU, some only accessible by the GPU,
3814 Using the constant memory space indicates that the data will not change during
3815 the execution of the kernel. This allows scalar read instructions to be
3816 used. The vector and scalar L1 caches are invalidated of volatile data before
3817 each kernel dispatch execution to allow constant memory to change values between
3820 The local memory space uses the hardware Local Data Store (LDS) which is
3821 automatically allocated when the hardware creates work-groups of wavefronts, and
3822 freed when all the wavefronts of a work-group have terminated. The data store
3823 (DS) instructions can be used to access it.
3825 The private memory space uses the hardware scratch memory support. If the kernel
3826 uses scratch, then the hardware allocates memory that is accessed using
3827 wavefront lane dword (4 byte) interleaving. The mapping used from private
3828 address to physical address is:
3830 ``wavefront-scratch-base +
3831 (private-address * wavefront-size * 4) +
3832 (wavefront-lane-id * 4)``
3834 There are different ways that the wavefront scratch base address is determined
3835 by a wavefront (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). This
3836 memory can be accessed in an interleaved manner using buffer instruction with
3837 the scratch buffer descriptor and per wavefront scratch offset, by the scratch
3838 instructions, or by flat instructions. If each lane of a wavefront accesses the
3839 same private address, the interleaving results in adjacent dwords being accessed
3840 and hence requires fewer cache lines to be fetched. Multi-dword access is not
3841 supported except by flat and scratch instructions in GFX9-GFX11.
3843 The generic address space uses the hardware flat address support available in
3844 GFX7-GFX11. This uses two fixed ranges of virtual addresses (the private and
3845 local apertures), that are outside the range of addressible global memory, to
3846 map from a flat address to a private or local address.
3848 FLAT instructions can take a flat address and access global, private (scratch)
3849 and group (LDS) memory depending on if the address is within one of the
3850 aperture ranges. Flat access to scratch requires hardware aperture setup and
3851 setup in the kernel prologue (see
3852 :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`). Flat access to LDS requires
3853 hardware aperture setup and M0 (GFX7-GFX8) register setup (see
3854 :ref:`amdgpu-amdhsa-kernel-prolog-m0`).
3856 To convert between a segment address and a flat address the base address of the
3857 apertures address can be used. For GFX7-GFX8 these are available in the
3858 :ref:`amdgpu-amdhsa-hsa-aql-queue` the address of which can be obtained with
3859 Queue Ptr SGPR (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). For
3860 GFX9-GFX11 the aperture base addresses are directly available as inline constant
3861 registers ``SRC_SHARED_BASE/LIMIT`` and ``SRC_PRIVATE_BASE/LIMIT``. In 64 bit
3862 address mode the aperture sizes are 2^32 bytes and the base is aligned to 2^32
3863 which makes it easier to convert from flat to segment or segment to flat.
3868 Image and sample handles created by an HSA compatible runtime (see
3869 :ref:`amdgpu-os`) are 64-bit addresses of a hardware 32-byte V# and 48 byte S#
3870 object respectively. In order to support the HSA ``query_sampler`` operations
3871 two extra dwords are used to store the HSA BRIG enumeration values for the
3872 queries that are not trivially deducible from the S# representation.
3877 HSA signal handles created by an HSA compatible runtime (see :ref:`amdgpu-os`)
3878 are 64-bit addresses of a structure allocated in memory accessible from both the
3879 CPU and GPU. The structure is defined by the runtime and subject to change
3880 between releases. For example, see [AMD-ROCm-github]_.
3882 .. _amdgpu-amdhsa-hsa-aql-queue:
3887 The HSA AQL queue structure is defined by an HSA compatible runtime (see
3888 :ref:`amdgpu-os`) and subject to change between releases. For example, see
3889 [AMD-ROCm-github]_. For some processors it contains fields needed to implement
3890 certain language features such as the flat address aperture bases. It also
3891 contains fields used by CP such as managing the allocation of scratch memory.
3893 .. _amdgpu-amdhsa-kernel-descriptor:
3898 A kernel descriptor consists of the information needed by CP to initiate the
3899 execution of a kernel, including the entry point address of the machine code
3900 that implements the kernel.
3902 Code Object V3 Kernel Descriptor
3903 ++++++++++++++++++++++++++++++++
3905 CP microcode requires the Kernel descriptor to be allocated on 64-byte
3908 The fields used by CP for code objects before V3 also match those specified in
3909 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
3911 .. table:: Code Object V3 Kernel Descriptor
3912 :name: amdgpu-amdhsa-kernel-descriptor-v3-table
3914 ======= ======= =============================== ============================
3915 Bits Size Field Name Description
3916 ======= ======= =============================== ============================
3917 31:0 4 bytes GROUP_SEGMENT_FIXED_SIZE The amount of fixed local
3918 address space memory
3919 required for a work-group
3920 in bytes. This does not
3921 include any dynamically
3922 allocated local address
3923 space memory that may be
3924 added when the kernel is
3926 63:32 4 bytes PRIVATE_SEGMENT_FIXED_SIZE The amount of fixed
3927 private address space
3928 memory required for a
3929 work-item in bytes. When
3930 this cannot be predicted,
3931 code object v4 and older
3932 sets this value to be
3933 higher than the minimum
3935 95:64 4 bytes KERNARG_SIZE The size of the kernarg
3936 memory pointed to by the
3937 AQL dispatch packet. The
3938 kernarg memory is used to
3939 pass arguments to the
3942 * If the kernarg pointer in
3943 the dispatch packet is NULL
3944 then there are no kernel
3946 * If the kernarg pointer in
3947 the dispatch packet is
3948 not NULL and this value
3949 is 0 then the kernarg
3952 * If the kernarg pointer in
3953 the dispatch packet is
3954 not NULL and this value
3955 is not 0 then the value
3956 specifies the kernarg
3957 memory size in bytes. It
3958 is recommended to provide
3959 a value as it may be used
3960 by CP to optimize making
3962 visible to the kernel
3965 127:96 4 bytes Reserved, must be 0.
3966 191:128 8 bytes KERNEL_CODE_ENTRY_BYTE_OFFSET Byte offset (possibly
3969 descriptor to kernel's
3970 entry point instruction
3971 which must be 256 byte
3973 351:272 20 Reserved, must be 0.
3975 383:352 4 bytes COMPUTE_PGM_RSRC3 GFX6-GFX9
3976 Reserved, must be 0.
3979 program settings used by
3981 ``COMPUTE_PGM_RSRC3``
3984 :ref:`amdgpu-amdhsa-compute_pgm_rsrc3-gfx90a-table`.
3987 program settings used by
3989 ``COMPUTE_PGM_RSRC3``
3992 :ref:`amdgpu-amdhsa-compute_pgm_rsrc3-gfx10-gfx11-table`.
3993 415:384 4 bytes COMPUTE_PGM_RSRC1 Compute Shader (CS)
3994 program settings used by
3996 ``COMPUTE_PGM_RSRC1``
3999 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx11-table`.
4000 447:416 4 bytes COMPUTE_PGM_RSRC2 Compute Shader (CS)
4001 program settings used by
4003 ``COMPUTE_PGM_RSRC2``
4006 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx11-table`.
4007 458:448 7 bits *See separate bits below.* Enable the setup of the
4008 SGPR user data registers
4010 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
4012 The total number of SGPR
4014 requested must not exceed
4015 16 and match value in
4016 ``compute_pgm_rsrc2.user_sgpr.user_sgpr_count``.
4017 Any requests beyond 16
4019 >448 1 bit ENABLE_SGPR_PRIVATE_SEGMENT If the *Target Properties*
4021 :ref:`amdgpu-processor-table`
4022 specifies *Architected flat
4023 scratch* then not supported
4025 >449 1 bit ENABLE_SGPR_DISPATCH_PTR
4026 >450 1 bit ENABLE_SGPR_QUEUE_PTR
4027 >451 1 bit ENABLE_SGPR_KERNARG_SEGMENT_PTR
4028 >452 1 bit ENABLE_SGPR_DISPATCH_ID
4029 >453 1 bit ENABLE_SGPR_FLAT_SCRATCH_INIT If the *Target Properties*
4031 :ref:`amdgpu-processor-table`
4032 specifies *Architected flat
4033 scratch* then not supported
4035 >454 1 bit ENABLE_SGPR_PRIVATE_SEGMENT
4037 457:455 3 bits Reserved, must be 0.
4038 458 1 bit ENABLE_WAVEFRONT_SIZE32 GFX6-GFX9
4039 Reserved, must be 0.
4042 wavefront size 64 mode.
4044 native wavefront size
4046 459 1 bit USES_DYNAMIC_STACK Indicates if the generated
4047 machine code is using a
4048 dynamically sized stack.
4049 This is only set in code
4050 object v5 and later.
4051 463:460 1 bit Reserved, must be 0.
4052 464 1 bit RESERVED_464 Deprecated, must be 0.
4053 467:465 3 bits Reserved, must be 0.
4054 468 1 bit RESERVED_468 Deprecated, must be 0.
4055 469:471 3 bits Reserved, must be 0.
4056 511:472 5 bytes Reserved, must be 0.
4057 512 **Total size 64 bytes.**
4058 ======= ====================================================================
4062 .. table:: compute_pgm_rsrc1 for GFX6-GFX11
4063 :name: amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx11-table
4065 ======= ======= =============================== ===========================================================================
4066 Bits Size Field Name Description
4067 ======= ======= =============================== ===========================================================================
4068 5:0 6 bits GRANULATED_WORKITEM_VGPR_COUNT Number of vector register
4069 blocks used by each work-item;
4070 granularity is device
4075 - max(0, ceil(vgprs_used / 4) - 1)
4078 - vgprs_used = align(arch_vgprs, 4)
4080 - max(0, ceil(vgprs_used / 8) - 1)
4081 GFX10-GFX11 (wavefront size 64)
4083 - max(0, ceil(vgprs_used / 4) - 1)
4084 GFX10-GFX11 (wavefront size 32)
4086 - max(0, ceil(vgprs_used / 8) - 1)
4088 Where vgprs_used is defined
4089 as the highest VGPR number
4090 explicitly referenced plus
4093 Used by CP to set up
4094 ``COMPUTE_PGM_RSRC1.VGPRS``.
4097 :ref:`amdgpu-assembler`
4099 automatically for the
4100 selected processor from
4101 values provided to the
4102 `.amdhsa_kernel` directive
4104 `.amdhsa_next_free_vgpr`
4105 nested directive (see
4106 :ref:`amdhsa-kernel-directives-table`).
4107 9:6 4 bits GRANULATED_WAVEFRONT_SGPR_COUNT Number of scalar register
4108 blocks used by a wavefront;
4109 granularity is device
4114 - max(0, ceil(sgprs_used / 8) - 1)
4117 - 2 * max(0, ceil(sgprs_used / 16) - 1)
4119 Reserved, must be 0.
4124 defined as the highest
4125 SGPR number explicitly
4126 referenced plus one, plus
4127 a target specific number
4128 of additional special
4130 FLAT_SCRATCH (GFX7+) and
4131 XNACK_MASK (GFX8+), and
4134 limitations. It does not
4135 include the 16 SGPRs added
4136 if a trap handler is
4140 limitations and special
4141 SGPR layout are defined in
4143 documentation, which can
4145 :ref:`amdgpu-processors`
4148 Used by CP to set up
4149 ``COMPUTE_PGM_RSRC1.SGPRS``.
4152 :ref:`amdgpu-assembler`
4154 automatically for the
4155 selected processor from
4156 values provided to the
4157 `.amdhsa_kernel` directive
4159 `.amdhsa_next_free_sgpr`
4160 and `.amdhsa_reserve_*`
4161 nested directives (see
4162 :ref:`amdhsa-kernel-directives-table`).
4163 11:10 2 bits PRIORITY Must be 0.
4165 Start executing wavefront
4166 at the specified priority.
4168 CP is responsible for
4170 ``COMPUTE_PGM_RSRC1.PRIORITY``.
4171 13:12 2 bits FLOAT_ROUND_MODE_32 Wavefront starts execution
4172 with specified rounding
4175 precision floating point
4178 Floating point rounding
4179 mode values are defined in
4180 :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
4182 Used by CP to set up
4183 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
4184 15:14 2 bits FLOAT_ROUND_MODE_16_64 Wavefront starts execution
4185 with specified rounding
4186 denorm mode for half/double (16
4187 and 64-bit) floating point
4188 precision floating point
4191 Floating point rounding
4192 mode values are defined in
4193 :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
4195 Used by CP to set up
4196 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
4197 17:16 2 bits FLOAT_DENORM_MODE_32 Wavefront starts execution
4198 with specified denorm mode
4201 precision floating point
4204 Floating point denorm mode
4205 values are defined in
4206 :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
4208 Used by CP to set up
4209 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
4210 19:18 2 bits FLOAT_DENORM_MODE_16_64 Wavefront starts execution
4211 with specified denorm mode
4213 and 64-bit) floating point
4214 precision floating point
4217 Floating point denorm mode
4218 values are defined in
4219 :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
4221 Used by CP to set up
4222 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
4223 20 1 bit PRIV Must be 0.
4225 Start executing wavefront
4226 in privilege trap handler
4229 CP is responsible for
4231 ``COMPUTE_PGM_RSRC1.PRIV``.
4232 21 1 bit ENABLE_DX10_CLAMP Wavefront starts execution
4233 with DX10 clamp mode
4234 enabled. Used by the vector
4235 ALU to force DX10 style
4236 treatment of NaN's (when
4237 set, clamp NaN to zero,
4241 Used by CP to set up
4242 ``COMPUTE_PGM_RSRC1.DX10_CLAMP``.
4243 22 1 bit DEBUG_MODE Must be 0.
4245 Start executing wavefront
4246 in single step mode.
4248 CP is responsible for
4250 ``COMPUTE_PGM_RSRC1.DEBUG_MODE``.
4251 23 1 bit ENABLE_IEEE_MODE Wavefront starts execution
4253 enabled. Floating point
4254 opcodes that support
4255 exception flag gathering
4256 will quiet and propagate
4257 signaling-NaN inputs per
4258 IEEE 754-2008. Min_dx10 and
4259 max_dx10 become IEEE
4260 754-2008 compliant due to
4261 signaling-NaN propagation
4264 Used by CP to set up
4265 ``COMPUTE_PGM_RSRC1.IEEE_MODE``.
4266 24 1 bit BULKY Must be 0.
4268 Only one work-group allowed
4269 to execute on a compute
4272 CP is responsible for
4274 ``COMPUTE_PGM_RSRC1.BULKY``.
4275 25 1 bit CDBG_USER Must be 0.
4277 Flag that can be used to
4278 control debugging code.
4280 CP is responsible for
4282 ``COMPUTE_PGM_RSRC1.CDBG_USER``.
4283 26 1 bit FP16_OVFL GFX6-GFX8
4284 Reserved, must be 0.
4286 Wavefront starts execution
4287 with specified fp16 overflow
4290 - If 0, fp16 overflow generates
4292 - If 1, fp16 overflow that is the
4293 result of an +/-INF input value
4294 or divide by 0 produces a +/-INF,
4295 otherwise clamps computed
4296 overflow to +/-MAX_FP16 as
4299 Used by CP to set up
4300 ``COMPUTE_PGM_RSRC1.FP16_OVFL``.
4301 28:27 2 bits Reserved, must be 0.
4302 29 1 bit WGP_MODE GFX6-GFX9
4303 Reserved, must be 0.
4305 - If 0 execute work-groups in
4306 CU wavefront execution mode.
4307 - If 1 execute work-groups on
4308 in WGP wavefront execution mode.
4310 See :ref:`amdgpu-amdhsa-memory-model`.
4312 Used by CP to set up
4313 ``COMPUTE_PGM_RSRC1.WGP_MODE``.
4314 30 1 bit MEM_ORDERED GFX6-GFX9
4315 Reserved, must be 0.
4317 Controls the behavior of the
4318 s_waitcnt's vmcnt and vscnt
4321 - If 0 vmcnt reports completion
4322 of load and atomic with return
4323 out of order with sample
4324 instructions, and the vscnt
4325 reports the completion of
4326 store and atomic without
4328 - If 1 vmcnt reports completion
4329 of load, atomic with return
4330 and sample instructions in
4331 order, and the vscnt reports
4332 the completion of store and
4333 atomic without return in order.
4335 Used by CP to set up
4336 ``COMPUTE_PGM_RSRC1.MEM_ORDERED``.
4337 31 1 bit FWD_PROGRESS GFX6-GFX9
4338 Reserved, must be 0.
4340 - If 0 execute SIMD wavefronts
4341 using oldest first policy.
4342 - If 1 execute SIMD wavefronts to
4343 ensure wavefronts will make some
4346 Used by CP to set up
4347 ``COMPUTE_PGM_RSRC1.FWD_PROGRESS``.
4348 32 **Total size 4 bytes**
4349 ======= ===================================================================================================================
4353 .. table:: compute_pgm_rsrc2 for GFX6-GFX11
4354 :name: amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx11-table
4356 ======= ======= =============================== ===========================================================================
4357 Bits Size Field Name Description
4358 ======= ======= =============================== ===========================================================================
4359 0 1 bit ENABLE_PRIVATE_SEGMENT * Enable the setup of the
4361 * If the *Target Properties*
4363 :ref:`amdgpu-processor-table`
4366 scratch* then enable the
4368 wavefront scratch offset
4369 system register (see
4370 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
4371 * If the *Target Properties*
4373 :ref:`amdgpu-processor-table`
4374 specifies *Architected
4375 flat scratch* then enable
4377 FLAT_SCRATCH register
4379 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
4381 Used by CP to set up
4382 ``COMPUTE_PGM_RSRC2.SCRATCH_EN``.
4383 5:1 5 bits USER_SGPR_COUNT The total number of SGPR
4385 registers requested. This
4386 number must be greater than
4387 or equal to the number of user
4388 data registers enabled.
4390 Used by CP to set up
4391 ``COMPUTE_PGM_RSRC2.USER_SGPR``.
4392 6 1 bit ENABLE_TRAP_HANDLER Must be 0.
4395 ``COMPUTE_PGM_RSRC2.TRAP_PRESENT``,
4396 which is set by the CP if
4397 the runtime has installed a
4399 7 1 bit ENABLE_SGPR_WORKGROUP_ID_X Enable the setup of the
4400 system SGPR register for
4401 the work-group id in the X
4403 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
4405 Used by CP to set up
4406 ``COMPUTE_PGM_RSRC2.TGID_X_EN``.
4407 8 1 bit ENABLE_SGPR_WORKGROUP_ID_Y Enable the setup of the
4408 system SGPR register for
4409 the work-group id in the Y
4411 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
4413 Used by CP to set up
4414 ``COMPUTE_PGM_RSRC2.TGID_Y_EN``.
4415 9 1 bit ENABLE_SGPR_WORKGROUP_ID_Z Enable the setup of the
4416 system SGPR register for
4417 the work-group id in the Z
4419 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
4421 Used by CP to set up
4422 ``COMPUTE_PGM_RSRC2.TGID_Z_EN``.
4423 10 1 bit ENABLE_SGPR_WORKGROUP_INFO Enable the setup of the
4424 system SGPR register for
4425 work-group information (see
4426 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
4428 Used by CP to set up
4429 ``COMPUTE_PGM_RSRC2.TGID_SIZE_EN``.
4430 12:11 2 bits ENABLE_VGPR_WORKITEM_ID Enable the setup of the
4431 VGPR system registers used
4432 for the work-item ID.
4433 :ref:`amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table`
4436 Used by CP to set up
4437 ``COMPUTE_PGM_RSRC2.TIDIG_CMP_CNT``.
4438 13 1 bit ENABLE_EXCEPTION_ADDRESS_WATCH Must be 0.
4440 Wavefront starts execution
4442 exceptions enabled which
4443 are generated when L1 has
4444 witnessed a thread access
4448 CP is responsible for
4449 filling in the address
4451 ``COMPUTE_PGM_RSRC2.EXCP_EN_MSB``
4452 according to what the
4454 14 1 bit ENABLE_EXCEPTION_MEMORY Must be 0.
4456 Wavefront starts execution
4457 with memory violation
4458 exceptions exceptions
4459 enabled which are generated
4460 when a memory violation has
4461 occurred for this wavefront from
4463 (write-to-read-only-memory,
4464 mis-aligned atomic, LDS
4465 address out of range,
4466 illegal address, etc.).
4470 ``COMPUTE_PGM_RSRC2.EXCP_EN_MSB``
4471 according to what the
4473 23:15 9 bits GRANULATED_LDS_SIZE Must be 0.
4475 CP uses the rounded value
4476 from the dispatch packet,
4477 not this value, as the
4478 dispatch may contain
4479 dynamically allocated group
4480 segment memory. CP writes
4482 ``COMPUTE_PGM_RSRC2.LDS_SIZE``.
4484 Amount of group segment
4485 (LDS) to allocate for each
4486 work-group. Granularity is
4490 roundup(lds-size / (64 * 4))
4492 roundup(lds-size / (128 * 4))
4494 24 1 bit ENABLE_EXCEPTION_IEEE_754_FP Wavefront starts execution
4495 _INVALID_OPERATION with specified exceptions
4498 Used by CP to set up
4499 ``COMPUTE_PGM_RSRC2.EXCP_EN``
4500 (set from bits 0..6).
4504 25 1 bit ENABLE_EXCEPTION_FP_DENORMAL FP Denormal one or more
4505 _SOURCE input operands is a
4507 26 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Division by
4508 _DIVISION_BY_ZERO Zero
4509 27 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP FP Overflow
4511 28 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Underflow
4513 29 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Inexact
4515 30 1 bit ENABLE_EXCEPTION_INT_DIVIDE_BY Integer Division by Zero
4516 _ZERO (rcp_iflag_f32 instruction
4518 31 1 bit Reserved, must be 0.
4519 32 **Total size 4 bytes.**
4520 ======= ===================================================================================================================
4524 .. table:: compute_pgm_rsrc3 for GFX90A, GFX940
4525 :name: amdgpu-amdhsa-compute_pgm_rsrc3-gfx90a-table
4527 ======= ======= =============================== ===========================================================================
4528 Bits Size Field Name Description
4529 ======= ======= =============================== ===========================================================================
4530 5:0 6 bits ACCUM_OFFSET Offset of a first AccVGPR in the unified register file. Granularity 4.
4531 Value 0-63. 0 - accum-offset = 4, 1 - accum-offset = 8, ...,
4532 63 - accum-offset = 256.
4533 6:15 10 Reserved, must be 0.
4535 16 1 bit TG_SPLIT - If 0 the waves of a work-group are
4536 launched in the same CU.
4537 - If 1 the waves of a work-group can be
4538 launched in different CUs. The waves
4539 cannot use S_BARRIER or LDS.
4540 17:31 15 Reserved, must be 0.
4542 32 **Total size 4 bytes.**
4543 ======= ===================================================================================================================
4547 .. table:: compute_pgm_rsrc3 for GFX10-GFX11
4548 :name: amdgpu-amdhsa-compute_pgm_rsrc3-gfx10-gfx11-table
4550 ======= ======= =============================== ===========================================================================
4551 Bits Size Field Name Description
4552 ======= ======= =============================== ===========================================================================
4553 3:0 4 bits SHARED_VGPR_COUNT Number of shared VGPR blocks when executing in subvector mode. For
4554 wavefront size 64 the value is 0-15, representing 0-120 VGPRs (granularity
4555 of 8), such that (compute_pgm_rsrc1.vgprs +1)*4 + shared_vgpr_count*8 does
4556 not exceed 256. For wavefront size 32 shared_vgpr_count must be 0.
4557 9:4 6 bits INST_PREF_SIZE GFX10
4558 Reserved, must be 0.
4560 Number of instruction bytes to prefetch, starting at the kernel's entry
4561 point instruction, before wavefront starts execution. The value is 0..63
4562 with a granularity of 128 bytes.
4563 10 1 bit TRAP_ON_START GFX10
4564 Reserved, must be 0.
4568 If 1, wavefront starts execution by trapping into the trap handler.
4570 CP is responsible for filling in the trap on start bit in
4571 ``COMPUTE_PGM_RSRC3.TRAP_ON_START`` according to what the runtime
4573 11 1 bit TRAP_ON_END GFX10
4574 Reserved, must be 0.
4578 If 1, wavefront execution terminates by trapping into the trap handler.
4580 CP is responsible for filling in the trap on end bit in
4581 ``COMPUTE_PGM_RSRC3.TRAP_ON_END`` according to what the runtime requests.
4582 30:12 19 bits Reserved, must be 0.
4583 31 1 bit IMAGE_OP GFX10
4584 Reserved, must be 0.
4586 If 1, the kernel execution contains image instructions. If executed as
4587 part of a graphics pipeline, image read instructions will stall waiting
4588 for any necessary ``WAIT_SYNC`` fence to be performed in order to
4589 indicate that earlier pipeline stages have completed writing to the
4592 Not used for compute kernels that are not part of a graphics pipeline and
4594 32 **Total size 4 bytes.**
4595 ======= ===================================================================================================================
4599 .. table:: Floating Point Rounding Mode Enumeration Values
4600 :name: amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table
4602 ====================================== ===== ==============================
4603 Enumeration Name Value Description
4604 ====================================== ===== ==============================
4605 FLOAT_ROUND_MODE_NEAR_EVEN 0 Round Ties To Even
4606 FLOAT_ROUND_MODE_PLUS_INFINITY 1 Round Toward +infinity
4607 FLOAT_ROUND_MODE_MINUS_INFINITY 2 Round Toward -infinity
4608 FLOAT_ROUND_MODE_ZERO 3 Round Toward 0
4609 ====================================== ===== ==============================
4613 .. table:: Floating Point Denorm Mode Enumeration Values
4614 :name: amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table
4616 ====================================== ===== ==============================
4617 Enumeration Name Value Description
4618 ====================================== ===== ==============================
4619 FLOAT_DENORM_MODE_FLUSH_SRC_DST 0 Flush Source and Destination
4621 FLOAT_DENORM_MODE_FLUSH_DST 1 Flush Output Denorms
4622 FLOAT_DENORM_MODE_FLUSH_SRC 2 Flush Source Denorms
4623 FLOAT_DENORM_MODE_FLUSH_NONE 3 No Flush
4624 ====================================== ===== ==============================
4628 .. table:: System VGPR Work-Item ID Enumeration Values
4629 :name: amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table
4631 ======================================== ===== ============================
4632 Enumeration Name Value Description
4633 ======================================== ===== ============================
4634 SYSTEM_VGPR_WORKITEM_ID_X 0 Set work-item X dimension
4636 SYSTEM_VGPR_WORKITEM_ID_X_Y 1 Set work-item X and Y
4638 SYSTEM_VGPR_WORKITEM_ID_X_Y_Z 2 Set work-item X, Y and Z
4640 SYSTEM_VGPR_WORKITEM_ID_UNDEFINED 3 Undefined.
4641 ======================================== ===== ============================
4643 .. _amdgpu-amdhsa-initial-kernel-execution-state:
4645 Initial Kernel Execution State
4646 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
4648 This section defines the register state that will be set up by the packet
4649 processor prior to the start of execution of every wavefront. This is limited by
4650 the constraints of the hardware controllers of CP/ADC/SPI.
4652 The order of the SGPR registers is defined, but the compiler can specify which
4653 ones are actually setup in the kernel descriptor using the ``enable_sgpr_*`` bit
4654 fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used
4655 for enabled registers are dense starting at SGPR0: the first enabled register is
4656 SGPR0, the next enabled register is SGPR1 etc.; disabled registers do not have
4659 The initial SGPRs comprise up to 16 User SRGPs that are set by CP and apply to
4660 all wavefronts of the grid. It is possible to specify more than 16 User SGPRs
4661 using the ``enable_sgpr_*`` bit fields, in which case only the first 16 are
4662 actually initialized. These are then immediately followed by the System SGPRs
4663 that are set up by ADC/SPI and can have different values for each wavefront of
4666 SGPR register initial state is defined in
4667 :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
4669 .. table:: SGPR Register Set Up Order
4670 :name: amdgpu-amdhsa-sgpr-register-set-up-order-table
4672 ========== ========================== ====== ==============================
4673 SGPR Order Name Number Description
4674 (kernel descriptor enable of
4676 ========== ========================== ====== ==============================
4677 First Private Segment Buffer 4 See
4678 (enable_sgpr_private :ref:`amdgpu-amdhsa-kernel-prolog-private-segment-buffer`.
4680 then Dispatch Ptr 2 64-bit address of AQL dispatch
4681 (enable_sgpr_dispatch_ptr) packet for kernel dispatch
4683 then Queue Ptr 2 64-bit address of amd_queue_t
4684 (enable_sgpr_queue_ptr) object for AQL queue on which
4685 the dispatch packet was
4687 then Kernarg Segment Ptr 2 64-bit address of Kernarg
4688 (enable_sgpr_kernarg segment. This is directly
4689 _segment_ptr) copied from the
4690 kernarg_address in the kernel
4693 Having CP load it once avoids
4694 loading it at the beginning of
4696 then Dispatch Id 2 64-bit Dispatch ID of the
4697 (enable_sgpr_dispatch_id) dispatch packet being
4699 then Flat Scratch Init 2 See
4700 (enable_sgpr_flat_scratch :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`.
4702 then Private Segment Size 1 The 32-bit byte size of a
4703 (enable_sgpr_private single work-item's memory
4704 _segment_size) allocation. This is the
4705 value from the kernel
4706 dispatch packet Private
4707 Segment Byte Size rounded up
4708 by CP to a multiple of
4711 Having CP load it once avoids
4712 loading it at the beginning of
4715 This is not used for
4716 GFX7-GFX8 since it is the same
4717 value as the second SGPR of
4718 Flat Scratch Init. However, it
4719 may be needed for GFX9-GFX11 which
4720 changes the meaning of the
4721 Flat Scratch Init value.
4722 then Work-Group Id X 1 32-bit work-group id in X
4723 (enable_sgpr_workgroup_id dimension of grid for
4725 then Work-Group Id Y 1 32-bit work-group id in Y
4726 (enable_sgpr_workgroup_id dimension of grid for
4728 then Work-Group Id Z 1 32-bit work-group id in Z
4729 (enable_sgpr_workgroup_id dimension of grid for
4731 then Work-Group Info 1 {first_wavefront, 14'b0000,
4732 (enable_sgpr_workgroup ordered_append_term[10:0],
4733 _info) threadgroup_size_in_wavefronts[5:0]}
4734 then Scratch Wavefront Offset 1 See
4735 (enable_sgpr_private :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`.
4736 _segment_wavefront_offset) and
4737 :ref:`amdgpu-amdhsa-kernel-prolog-private-segment-buffer`.
4738 ========== ========================== ====== ==============================
4740 The order of the VGPR registers is defined, but the compiler can specify which
4741 ones are actually setup in the kernel descriptor using the ``enable_vgpr*`` bit
4742 fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used
4743 for enabled registers are dense starting at VGPR0: the first enabled register is
4744 VGPR0, the next enabled register is VGPR1 etc.; disabled registers do not have a
4747 There are different methods used for the VGPR initial state:
4749 * Unless the *Target Properties* column of :ref:`amdgpu-processor-table`
4750 specifies otherwise, a separate VGPR register is used per work-item ID. The
4751 VGPR register initial state for this method is defined in
4752 :ref:`amdgpu-amdhsa-vgpr-register-set-up-order-for-unpacked-work-item-id-method-table`.
4753 * If *Target Properties* column of :ref:`amdgpu-processor-table`
4754 specifies *Packed work-item IDs*, the initial value of VGPR0 register is used
4755 for all work-item IDs. The register layout for this method is defined in
4756 :ref:`amdgpu-amdhsa-register-layout-for-packed-work-item-id-method-table`.
4758 .. table:: VGPR Register Set Up Order for Unpacked Work-Item ID Method
4759 :name: amdgpu-amdhsa-vgpr-register-set-up-order-for-unpacked-work-item-id-method-table
4761 ========== ========================== ====== ==============================
4762 VGPR Order Name Number Description
4763 (kernel descriptor enable of
4765 ========== ========================== ====== ==============================
4766 First Work-Item Id X 1 32-bit work-item id in X
4767 (Always initialized) dimension of work-group for
4769 then Work-Item Id Y 1 32-bit work-item id in Y
4770 (enable_vgpr_workitem_id dimension of work-group for
4771 > 0) wavefront lane.
4772 then Work-Item Id Z 1 32-bit work-item id in Z
4773 (enable_vgpr_workitem_id dimension of work-group for
4774 > 1) wavefront lane.
4775 ========== ========================== ====== ==============================
4779 .. table:: Register Layout for Packed Work-Item ID Method
4780 :name: amdgpu-amdhsa-register-layout-for-packed-work-item-id-method-table
4782 ======= ======= ================ =========================================
4783 Bits Size Field Name Description
4784 ======= ======= ================ =========================================
4785 0:9 10 bits Work-Item Id X Work-item id in X
4786 dimension of work-group for
4791 10:19 10 bits Work-Item Id Y Work-item id in Y
4792 dimension of work-group for
4795 Initialized if enable_vgpr_workitem_id >
4796 0, otherwise set to 0.
4797 20:29 10 bits Work-Item Id Z Work-item id in Z
4798 dimension of work-group for
4801 Initialized if enable_vgpr_workitem_id >
4802 1, otherwise set to 0.
4803 30:31 2 bits Reserved, set to 0.
4804 ======= ======= ================ =========================================
4806 The setting of registers is done by GPU CP/ADC/SPI hardware as follows:
4808 1. SGPRs before the Work-Group Ids are set by CP using the 16 User Data
4810 2. Work-group Id registers X, Y, Z are set by ADC which supports any
4811 combination including none.
4812 3. Scratch Wavefront Offset is set by SPI in a per wavefront basis which is why
4813 its value cannot be included with the flat scratch init value which is per
4814 queue (see :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`).
4815 4. The VGPRs are set by SPI which only supports specifying either (X), (X, Y)
4817 5. Flat Scratch register pair initialization is described in
4818 :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`.
4820 The global segment can be accessed either using buffer instructions (GFX6 which
4821 has V# 64-bit address support), flat instructions (GFX7-GFX11), or global
4822 instructions (GFX9-GFX11).
4824 If buffer operations are used, then the compiler can generate a V# with the
4825 following properties:
4829 * ATC: 1 if IOMMU present (such as APU)
4831 * MTYPE set to support memory coherence that matches the runtime (such as CC for
4832 APU and NC for dGPU).
4834 .. _amdgpu-amdhsa-kernel-prolog:
4839 The compiler performs initialization in the kernel prologue depending on the
4840 target and information about things like stack usage in the kernel and called
4841 functions. Some of this initialization requires the compiler to request certain
4842 User and System SGPRs be present in the
4843 :ref:`amdgpu-amdhsa-initial-kernel-execution-state` via the
4844 :ref:`amdgpu-amdhsa-kernel-descriptor`.
4846 .. _amdgpu-amdhsa-kernel-prolog-cfi:
4851 1. The CFI return address is undefined.
4853 2. The CFI CFA is defined using an expression which evaluates to a location
4854 description that comprises one memory location description for the
4855 ``DW_ASPACE_AMDGPU_private_lane`` address space address ``0``.
4857 .. _amdgpu-amdhsa-kernel-prolog-m0:
4863 The M0 register must be initialized with a value at least the total LDS size
4864 if the kernel may access LDS via DS or flat operations. Total LDS size is
4865 available in dispatch packet. For M0, it is also possible to use maximum
4866 possible value of LDS for given target (0x7FFF for GFX6 and 0xFFFF for
4869 The M0 register is not used for range checking LDS accesses and so does not
4870 need to be initialized in the prolog.
4872 .. _amdgpu-amdhsa-kernel-prolog-stack-pointer:
4877 If the kernel has function calls it must set up the ABI stack pointer described
4878 in :ref:`amdgpu-amdhsa-function-call-convention-non-kernel-functions` by setting
4879 SGPR32 to the unswizzled scratch offset of the address past the last local
4882 .. _amdgpu-amdhsa-kernel-prolog-frame-pointer:
4887 If the kernel needs a frame pointer for the reasons defined in
4888 ``SIFrameLowering`` then SGPR33 is used and is always set to ``0`` in the
4889 kernel prolog. If a frame pointer is not required then all uses of the frame
4890 pointer are replaced with immediate ``0`` offsets.
4892 .. _amdgpu-amdhsa-kernel-prolog-flat-scratch:
4897 There are different methods used for initializing flat scratch:
4899 * If the *Target Properties* column of :ref:`amdgpu-processor-table`
4900 specifies *Does not support generic address space*:
4902 Flat scratch is not supported and there is no flat scratch register pair.
4904 * If the *Target Properties* column of :ref:`amdgpu-processor-table`
4905 specifies *Offset flat scratch*:
4907 If the kernel or any function it calls may use flat operations to access
4908 scratch memory, the prolog code must set up the FLAT_SCRATCH register pair
4909 (FLAT_SCRATCH_LO/FLAT_SCRATCH_HI). Initialization uses Flat Scratch Init and
4910 Scratch Wavefront Offset SGPR registers (see
4911 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`):
4913 1. The low word of Flat Scratch Init is the 32-bit byte offset from
4914 ``SH_HIDDEN_PRIVATE_BASE_VIMID`` to the base of scratch backing memory
4915 being managed by SPI for the queue executing the kernel dispatch. This is
4916 the same value used in the Scratch Segment Buffer V# base address.
4918 CP obtains this from the runtime. (The Scratch Segment Buffer base address
4919 is ``SH_HIDDEN_PRIVATE_BASE_VIMID`` plus this offset.)
4921 The prolog must add the value of Scratch Wavefront Offset to get the
4922 wavefront's byte scratch backing memory offset from
4923 ``SH_HIDDEN_PRIVATE_BASE_VIMID``.
4925 The Scratch Wavefront Offset must also be used as an offset with Private
4926 segment address when using the Scratch Segment Buffer.
4928 Since FLAT_SCRATCH_LO is in units of 256 bytes, the offset must be right
4929 shifted by 8 before moving into FLAT_SCRATCH_HI.
4931 FLAT_SCRATCH_HI corresponds to SGPRn-4 on GFX7, and SGPRn-6 on GFX8 (where
4932 SGPRn is the highest numbered SGPR allocated to the wavefront).
4933 FLAT_SCRATCH_HI is multiplied by 256 (as it is in units of 256 bytes) and
4934 added to ``SH_HIDDEN_PRIVATE_BASE_VIMID`` to calculate the per wavefront
4935 FLAT SCRATCH BASE in flat memory instructions that access the scratch
4937 2. The second word of Flat Scratch Init is 32-bit byte size of a single
4938 work-items scratch memory usage.
4940 CP obtains this from the runtime, and it is always a multiple of DWORD. CP
4941 checks that the value in the kernel dispatch packet Private Segment Byte
4942 Size is not larger and requests the runtime to increase the queue's scratch
4945 CP directly loads from the kernel dispatch packet Private Segment Byte Size
4946 field and rounds up to a multiple of DWORD. Having CP load it once avoids
4947 loading it at the beginning of every wavefront.
4949 The kernel prolog code must move it to FLAT_SCRATCH_LO which is SGPRn-3 on
4950 GFX7 and SGPRn-5 on GFX8. FLAT_SCRATCH_LO is used as the FLAT SCRATCH SIZE
4951 in flat memory instructions.
4953 * If the *Target Properties* column of :ref:`amdgpu-processor-table`
4954 specifies *Absolute flat scratch*:
4956 If the kernel or any function it calls may use flat operations to access
4957 scratch memory, the prolog code must set up the FLAT_SCRATCH register pair
4958 (FLAT_SCRATCH_LO/FLAT_SCRATCH_HI which are in SGPRn-4/SGPRn-3). Initialization
4959 uses Flat Scratch Init and Scratch Wavefront Offset SGPR registers (see
4960 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`):
4962 The Flat Scratch Init is the 64-bit address of the base of scratch backing
4963 memory being managed by SPI for the queue executing the kernel dispatch.
4965 CP obtains this from the runtime.
4967 The kernel prolog must add the value of the wave's Scratch Wavefront Offset
4968 and move the result as a 64-bit value to the FLAT_SCRATCH SGPR register pair
4969 which is SGPRn-6 and SGPRn-5. It is used as the FLAT SCRATCH BASE in flat
4970 memory instructions.
4972 The Scratch Wavefront Offset must also be used as an offset with Private
4973 segment address when using the Scratch Segment Buffer (see
4974 :ref:`amdgpu-amdhsa-kernel-prolog-private-segment-buffer`).
4976 * If the *Target Properties* column of :ref:`amdgpu-processor-table`
4977 specifies *Architected flat scratch*:
4979 If ENABLE_PRIVATE_SEGMENT is enabled in
4980 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx11-table` then the FLAT_SCRATCH
4981 register pair will be initialized to the 64-bit address of the base of scratch
4982 backing memory being managed by SPI for the queue executing the kernel
4983 dispatch plus the value of the wave's Scratch Wavefront Offset for use as the
4984 flat scratch base in flat memory instructions.
4986 .. _amdgpu-amdhsa-kernel-prolog-private-segment-buffer:
4988 Private Segment Buffer
4989 ++++++++++++++++++++++
4991 If the *Target Properties* column of :ref:`amdgpu-processor-table` specifies
4992 *Architected flat scratch* then a Private Segment Buffer is not supported.
4993 Instead the flat SCRATCH instructions are used.
4995 Otherwise, Private Segment Buffer SGPR register is used to initialize 4 SGPRs
4996 that are used as a V# to access scratch. CP uses the value provided by the
4997 runtime. It is used, together with Scratch Wavefront Offset as an offset, to
4998 access the private memory space using a segment address. See
4999 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`.
5001 The scratch V# is a four-aligned SGPR and always selected for the kernel as
5004 - If it is known during instruction selection that there is stack usage,
5005 SGPR0-3 is reserved for use as the scratch V#. Stack usage is assumed if
5006 optimizations are disabled (``-O0``), if stack objects already exist (for
5007 locals, etc.), or if there are any function calls.
5009 - Otherwise, four high numbered SGPRs beginning at a four-aligned SGPR index
5010 are reserved for the tentative scratch V#. These will be used if it is
5011 determined that spilling is needed.
5013 - If no use is made of the tentative scratch V#, then it is unreserved,
5014 and the register count is determined ignoring it.
5015 - If use is made of the tentative scratch V#, then its register numbers
5016 are shifted to the first four-aligned SGPR index after the highest one
5017 allocated by the register allocator, and all uses are updated. The
5018 register count includes them in the shifted location.
5019 - In either case, if the processor has the SGPR allocation bug, the
5020 tentative allocation is not shifted or unreserved in order to ensure
5021 the register count is higher to workaround the bug.
5025 This approach of using a tentative scratch V# and shifting the register
5026 numbers if used avoids having to perform register allocation a second
5027 time if the tentative V# is eliminated. This is more efficient and
5028 avoids the problem that the second register allocation may perform
5029 spilling which will fail as there is no longer a scratch V#.
5031 When the kernel prolog code is being emitted it is known whether the scratch V#
5032 described above is actually used. If it is, the prolog code must set it up by
5033 copying the Private Segment Buffer to the scratch V# registers and then adding
5034 the Private Segment Wavefront Offset to the queue base address in the V#. The
5035 result is a V# with a base address pointing to the beginning of the wavefront
5036 scratch backing memory.
5038 The Private Segment Buffer is always requested, but the Private Segment
5039 Wavefront Offset is only requested if it is used (see
5040 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
5042 .. _amdgpu-amdhsa-memory-model:
5047 This section describes the mapping of the LLVM memory model onto AMDGPU machine
5048 code (see :ref:`memmodel`).
5050 The AMDGPU backend supports the memory synchronization scopes specified in
5051 :ref:`amdgpu-memory-scopes`.
5053 The code sequences used to implement the memory model specify the order of
5054 instructions that a single thread must execute. The ``s_waitcnt`` and cache
5055 management instructions such as ``buffer_wbinvl1_vol`` are defined with respect
5056 to other memory instructions executed by the same thread. This allows them to be
5057 moved earlier or later which can allow them to be combined with other instances
5058 of the same instruction, or hoisted/sunk out of loops to improve performance.
5059 Only the instructions related to the memory model are given; additional
5060 ``s_waitcnt`` instructions are required to ensure registers are defined before
5061 being used. These may be able to be combined with the memory model ``s_waitcnt``
5062 instructions as described above.
5064 The AMDGPU backend supports the following memory models:
5066 HSA Memory Model [HSA]_
5067 The HSA memory model uses a single happens-before relation for all address
5068 spaces (see :ref:`amdgpu-address-spaces`).
5069 OpenCL Memory Model [OpenCL]_
5070 The OpenCL memory model which has separate happens-before relations for the
5071 global and local address spaces. Only a fence specifying both global and
5072 local address space, and seq_cst instructions join the relationships. Since
5073 the LLVM ``memfence`` instruction does not allow an address space to be
5074 specified the OpenCL fence has to conservatively assume both local and
5075 global address space was specified. However, optimizations can often be
5076 done to eliminate the additional ``s_waitcnt`` instructions when there are
5077 no intervening memory instructions which access the corresponding address
5078 space. The code sequences in the table indicate what can be omitted for the
5079 OpenCL memory. The target triple environment is used to determine if the
5080 source language is OpenCL (see :ref:`amdgpu-opencl`).
5082 ``ds/flat_load/store/atomic`` instructions to local memory are termed LDS
5085 ``buffer/global/flat_load/store/atomic`` instructions to global memory are
5086 termed vector memory operations.
5088 Private address space uses ``buffer_load/store`` using the scratch V#
5089 (GFX6-GFX8), or ``scratch_load/store`` (GFX9-GFX11). Since only a single thread
5090 is accessing the memory, atomic memory orderings are not meaningful, and all
5091 accesses are treated as non-atomic.
5093 Constant address space uses ``buffer/global_load`` instructions (or equivalent
5094 scalar memory instructions). Since the constant address space contents do not
5095 change during the execution of a kernel dispatch it is not legal to perform
5096 stores, and atomic memory orderings are not meaningful, and all accesses are
5097 treated as non-atomic.
5099 A memory synchronization scope wider than work-group is not meaningful for the
5100 group (LDS) address space and is treated as work-group.
5102 The memory model does not support the region address space which is treated as
5105 Acquire memory ordering is not meaningful on store atomic instructions and is
5106 treated as non-atomic.
5108 Release memory ordering is not meaningful on load atomic instructions and is
5109 treated a non-atomic.
5111 Acquire-release memory ordering is not meaningful on load or store atomic
5112 instructions and is treated as acquire and release respectively.
5114 The memory order also adds the single thread optimization constraints defined in
5116 :ref:`amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-table`.
5118 .. table:: AMDHSA Memory Model Single Thread Optimization Constraints
5119 :name: amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-table
5121 ============ ==============================================================
5122 LLVM Memory Optimization Constraints
5124 ============ ==============================================================
5127 acquire - If a load atomic/atomicrmw then no following load/load
5128 atomic/store/store atomic/atomicrmw/fence instruction can be
5129 moved before the acquire.
5130 - If a fence then same as load atomic, plus no preceding
5131 associated fence-paired-atomic can be moved after the fence.
5132 release - If a store atomic/atomicrmw then no preceding load/load
5133 atomic/store/store atomic/atomicrmw/fence instruction can be
5134 moved after the release.
5135 - If a fence then same as store atomic, plus no following
5136 associated fence-paired-atomic can be moved before the
5138 acq_rel Same constraints as both acquire and release.
5139 seq_cst - If a load atomic then same constraints as acquire, plus no
5140 preceding sequentially consistent load atomic/store
5141 atomic/atomicrmw/fence instruction can be moved after the
5143 - If a store atomic then the same constraints as release, plus
5144 no following sequentially consistent load atomic/store
5145 atomic/atomicrmw/fence instruction can be moved before the
5147 - If an atomicrmw/fence then same constraints as acq_rel.
5148 ============ ==============================================================
5150 The code sequences used to implement the memory model are defined in the
5153 * :ref:`amdgpu-amdhsa-memory-model-gfx6-gfx9`
5154 * :ref:`amdgpu-amdhsa-memory-model-gfx90a`
5155 * :ref:`amdgpu-amdhsa-memory-model-gfx940`
5156 * :ref:`amdgpu-amdhsa-memory-model-gfx10-gfx11`
5158 .. _amdgpu-amdhsa-memory-model-gfx6-gfx9:
5160 Memory Model GFX6-GFX9
5161 ++++++++++++++++++++++
5165 * Each agent has multiple shader arrays (SA).
5166 * Each SA has multiple compute units (CU).
5167 * Each CU has multiple SIMDs that execute wavefronts.
5168 * The wavefronts for a single work-group are executed in the same CU but may be
5169 executed by different SIMDs.
5170 * Each CU has a single LDS memory shared by the wavefronts of the work-groups
5172 * All LDS operations of a CU are performed as wavefront wide operations in a
5173 global order and involve no caching. Completion is reported to a wavefront in
5175 * The LDS memory has multiple request queues shared by the SIMDs of a
5176 CU. Therefore, the LDS operations performed by different wavefronts of a
5177 work-group can be reordered relative to each other, which can result in
5178 reordering the visibility of vector memory operations with respect to LDS
5179 operations of other wavefronts in the same work-group. A ``s_waitcnt
5180 lgkmcnt(0)`` is required to ensure synchronization between LDS operations and
5181 vector memory operations between wavefronts of a work-group, but not between
5182 operations performed by the same wavefront.
5183 * The vector memory operations are performed as wavefront wide operations and
5184 completion is reported to a wavefront in execution order. The exception is
5185 that for GFX7-GFX9 ``flat_load/store/atomic`` instructions can report out of
5186 vector memory order if they access LDS memory, and out of LDS operation order
5187 if they access global memory.
5188 * The vector memory operations access a single vector L1 cache shared by all
5189 SIMDs a CU. Therefore, no special action is required for coherence between the
5190 lanes of a single wavefront, or for coherence between wavefronts in the same
5191 work-group. A ``buffer_wbinvl1_vol`` is required for coherence between
5192 wavefronts executing in different work-groups as they may be executing on
5194 * The scalar memory operations access a scalar L1 cache shared by all wavefronts
5195 on a group of CUs. The scalar and vector L1 caches are not coherent. However,
5196 scalar operations are used in a restricted way so do not impact the memory
5197 model. See :ref:`amdgpu-amdhsa-memory-spaces`.
5198 * The vector and scalar memory operations use an L2 cache shared by all CUs on
5200 * The L2 cache has independent channels to service disjoint ranges of virtual
5202 * Each CU has a separate request queue per channel. Therefore, the vector and
5203 scalar memory operations performed by wavefronts executing in different
5204 work-groups (which may be executing on different CUs) of an agent can be
5205 reordered relative to each other. A ``s_waitcnt vmcnt(0)`` is required to
5206 ensure synchronization between vector memory operations of different CUs. It
5207 ensures a previous vector memory operation has completed before executing a
5208 subsequent vector memory or LDS operation and so can be used to meet the
5209 requirements of acquire and release.
5210 * The L2 cache can be kept coherent with other agents on some targets, or ranges
5211 of virtual addresses can be set up to bypass it to ensure system coherence.
5213 Scalar memory operations are only used to access memory that is proven to not
5214 change during the execution of the kernel dispatch. This includes constant
5215 address space and global address space for program scope ``const`` variables.
5216 Therefore, the kernel machine code does not have to maintain the scalar cache to
5217 ensure it is coherent with the vector caches. The scalar and vector caches are
5218 invalidated between kernel dispatches by CP since constant address space data
5219 may change between kernel dispatch executions. See
5220 :ref:`amdgpu-amdhsa-memory-spaces`.
5222 The one exception is if scalar writes are used to spill SGPR registers. In this
5223 case the AMDGPU backend ensures the memory location used to spill is never
5224 accessed by vector memory operations at the same time. If scalar writes are used
5225 then a ``s_dcache_wb`` is inserted before the ``s_endpgm`` and before a function
5226 return since the locations may be used for vector memory instructions by a
5227 future wavefront that uses the same scratch area, or a function call that
5228 creates a frame at the same address, respectively. There is no need for a
5229 ``s_dcache_inv`` as all scalar writes are write-before-read in the same thread.
5231 For kernarg backing memory:
5233 * CP invalidates the L1 cache at the start of each kernel dispatch.
5234 * On dGPU the kernarg backing memory is allocated in host memory accessed as
5235 MTYPE UC (uncached) to avoid needing to invalidate the L2 cache. This also
5236 causes it to be treated as non-volatile and so is not invalidated by
5238 * On APU the kernarg backing memory it is accessed as MTYPE CC (cache coherent)
5239 and so the L2 cache will be coherent with the CPU and other agents.
5241 Scratch backing memory (which is used for the private address space) is accessed
5242 with MTYPE NC_NV (non-coherent non-volatile). Since the private address space is
5243 only accessed by a single thread, and is always write-before-read, there is
5244 never a need to invalidate these entries from the L1 cache. Hence all cache
5245 invalidates are done as ``*_vol`` to only invalidate the volatile cache lines.
5247 The code sequences used to implement the memory model for GFX6-GFX9 are defined
5248 in table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table`.
5250 .. table:: AMDHSA Memory Model Code Sequences GFX6-GFX9
5251 :name: amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table
5253 ============ ============ ============== ========== ================================
5254 LLVM Instr LLVM Memory LLVM Memory AMDGPU AMDGPU Machine Code
5255 Ordering Sync Scope Address GFX6-GFX9
5257 ============ ============ ============== ========== ================================
5259 ------------------------------------------------------------------------------------
5260 load *none* *none* - global - !volatile & !nontemporal
5262 - private 1. buffer/global/flat_load
5264 - !volatile & nontemporal
5266 1. buffer/global/flat_load
5271 1. buffer/global/flat_load
5273 2. s_waitcnt vmcnt(0)
5275 - Must happen before
5276 any following volatile
5287 load *none* *none* - local 1. ds_load
5288 store *none* *none* - global - !volatile & !nontemporal
5290 - private 1. buffer/global/flat_store
5292 - !volatile & nontemporal
5294 1. buffer/global/flat_store
5299 1. buffer/global/flat_store
5300 2. s_waitcnt vmcnt(0)
5302 - Must happen before
5303 any following volatile
5314 store *none* *none* - local 1. ds_store
5315 **Unordered Atomic**
5316 ------------------------------------------------------------------------------------
5317 load atomic unordered *any* *any* *Same as non-atomic*.
5318 store atomic unordered *any* *any* *Same as non-atomic*.
5319 atomicrmw unordered *any* *any* *Same as monotonic atomic*.
5320 **Monotonic Atomic**
5321 ------------------------------------------------------------------------------------
5322 load atomic monotonic - singlethread - global 1. buffer/global/ds/flat_load
5324 - workgroup - generic
5325 load atomic monotonic - agent - global 1. buffer/global/flat_load
5326 - system - generic glc=1
5327 store atomic monotonic - singlethread - global 1. buffer/global/flat_store
5328 - wavefront - generic
5332 store atomic monotonic - singlethread - local 1. ds_store
5335 atomicrmw monotonic - singlethread - global 1. buffer/global/flat_atomic
5336 - wavefront - generic
5340 atomicrmw monotonic - singlethread - local 1. ds_atomic
5344 ------------------------------------------------------------------------------------
5345 load atomic acquire - singlethread - global 1. buffer/global/ds/flat_load
5348 load atomic acquire - workgroup - global 1. buffer/global_load
5349 load atomic acquire - workgroup - local 1. ds/flat_load
5350 - generic 2. s_waitcnt lgkmcnt(0)
5353 - Must happen before
5362 older than a local load
5366 load atomic acquire - agent - global 1. buffer/global_load
5368 2. s_waitcnt vmcnt(0)
5370 - Must happen before
5378 3. buffer_wbinvl1_vol
5380 - Must happen before
5390 load atomic acquire - agent - generic 1. flat_load glc=1
5391 - system 2. s_waitcnt vmcnt(0) &
5396 - Must happen before
5399 - Ensures the flat_load
5404 3. buffer_wbinvl1_vol
5406 - Must happen before
5416 atomicrmw acquire - singlethread - global 1. buffer/global/ds/flat_atomic
5419 atomicrmw acquire - workgroup - global 1. buffer/global_atomic
5420 atomicrmw acquire - workgroup - local 1. ds/flat_atomic
5421 - generic 2. s_waitcnt lgkmcnt(0)
5424 - Must happen before
5437 atomicrmw acquire - agent - global 1. buffer/global_atomic
5438 - system 2. s_waitcnt vmcnt(0)
5440 - Must happen before
5449 3. buffer_wbinvl1_vol
5451 - Must happen before
5461 atomicrmw acquire - agent - generic 1. flat_atomic
5462 - system 2. s_waitcnt vmcnt(0) &
5467 - Must happen before
5476 3. buffer_wbinvl1_vol
5478 - Must happen before
5488 fence acquire - singlethread *none* *none*
5490 fence acquire - workgroup *none* 1. s_waitcnt lgkmcnt(0)
5495 - However, since LLVM
5520 fence-paired-atomic).
5521 - Must happen before
5532 fence-paired-atomic.
5534 fence acquire - agent *none* 1. s_waitcnt lgkmcnt(0) &
5541 - However, since LLVM
5549 - Could be split into
5558 - s_waitcnt vmcnt(0)
5569 fence-paired-atomic).
5570 - s_waitcnt lgkmcnt(0)
5581 fence-paired-atomic).
5582 - Must happen before
5596 fence-paired-atomic.
5598 2. buffer_wbinvl1_vol
5600 - Must happen before any
5601 following global/generic
5611 ------------------------------------------------------------------------------------
5612 store atomic release - singlethread - global 1. buffer/global/ds/flat_store
5615 store atomic release - workgroup - global 1. s_waitcnt lgkmcnt(0)
5624 - Must happen before
5635 2. buffer/global/flat_store
5636 store atomic release - workgroup - local 1. ds_store
5637 store atomic release - agent - global 1. s_waitcnt lgkmcnt(0) &
5638 - system - generic vmcnt(0)
5644 - Could be split into
5653 - s_waitcnt vmcnt(0)
5660 - s_waitcnt lgkmcnt(0)
5667 - Must happen before
5678 2. buffer/global/flat_store
5679 atomicrmw release - singlethread - global 1. buffer/global/ds/flat_atomic
5682 atomicrmw release - workgroup - global 1. s_waitcnt lgkmcnt(0)
5691 - Must happen before
5702 2. buffer/global/flat_atomic
5703 atomicrmw release - workgroup - local 1. ds_atomic
5704 atomicrmw release - agent - global 1. s_waitcnt lgkmcnt(0) &
5705 - system - generic vmcnt(0)
5709 - Could be split into
5718 - s_waitcnt vmcnt(0)
5725 - s_waitcnt lgkmcnt(0)
5732 - Must happen before
5743 2. buffer/global/flat_atomic
5744 fence release - singlethread *none* *none*
5746 fence release - workgroup *none* 1. s_waitcnt lgkmcnt(0)
5751 - However, since LLVM
5772 - Must happen before
5781 fence-paired-atomic).
5788 fence-paired-atomic.
5790 fence release - agent *none* 1. s_waitcnt lgkmcnt(0) &
5801 - However, since LLVM
5816 - Could be split into
5825 - s_waitcnt vmcnt(0)
5832 - s_waitcnt lgkmcnt(0)
5839 - Must happen before
5848 fence-paired-atomic).
5855 fence-paired-atomic.
5857 **Acquire-Release Atomic**
5858 ------------------------------------------------------------------------------------
5859 atomicrmw acq_rel - singlethread - global 1. buffer/global/ds/flat_atomic
5862 atomicrmw acq_rel - workgroup - global 1. s_waitcnt lgkmcnt(0)
5871 - Must happen before
5882 2. buffer/global_atomic
5884 atomicrmw acq_rel - workgroup - local 1. ds_atomic
5885 2. s_waitcnt lgkmcnt(0)
5888 - Must happen before
5897 older than the local load
5901 atomicrmw acq_rel - workgroup - generic 1. s_waitcnt lgkmcnt(0)
5910 - Must happen before
5922 3. s_waitcnt lgkmcnt(0)
5925 - Must happen before
5934 older than a local load
5938 atomicrmw acq_rel - agent - global 1. s_waitcnt lgkmcnt(0) &
5943 - Could be split into
5952 - s_waitcnt vmcnt(0)
5959 - s_waitcnt lgkmcnt(0)
5966 - Must happen before
5977 2. buffer/global_atomic
5978 3. s_waitcnt vmcnt(0)
5980 - Must happen before
5989 4. buffer_wbinvl1_vol
5991 - Must happen before
6001 atomicrmw acq_rel - agent - generic 1. s_waitcnt lgkmcnt(0) &
6006 - Could be split into
6015 - s_waitcnt vmcnt(0)
6022 - s_waitcnt lgkmcnt(0)
6029 - Must happen before
6041 3. s_waitcnt vmcnt(0) &
6046 - Must happen before
6055 4. buffer_wbinvl1_vol
6057 - Must happen before
6067 fence acq_rel - singlethread *none* *none*
6069 fence acq_rel - workgroup *none* 1. s_waitcnt lgkmcnt(0)
6089 - Must happen before
6112 acquire-fence-paired-atomic)
6133 release-fence-paired-atomic).
6138 fence acq_rel - agent *none* 1. s_waitcnt lgkmcnt(0) &
6145 - However, since LLVM
6153 - Could be split into
6162 - s_waitcnt vmcnt(0)
6169 - s_waitcnt lgkmcnt(0)
6176 - Must happen before
6181 global/local/generic
6190 acquire-fence-paired-atomic)
6202 global/local/generic
6211 release-fence-paired-atomic).
6216 2. buffer_wbinvl1_vol
6218 - Must happen before
6232 **Sequential Consistent Atomic**
6233 ------------------------------------------------------------------------------------
6234 load atomic seq_cst - singlethread - global *Same as corresponding
6235 - wavefront - local load atomic acquire,
6236 - generic except must generate
6237 all instructions even
6239 load atomic seq_cst - workgroup - global 1. s_waitcnt lgkmcnt(0)
6255 lgkmcnt(0) and so do
6287 order. The s_waitcnt
6288 could be placed after
6292 make the s_waitcnt be
6299 instructions same as
6302 except must generate
6303 all instructions even
6305 load atomic seq_cst - workgroup - local *Same as corresponding
6306 load atomic acquire,
6307 except must generate
6308 all instructions even
6311 load atomic seq_cst - agent - global 1. s_waitcnt lgkmcnt(0) &
6312 - system - generic vmcnt(0)
6314 - Could be split into
6323 - s_waitcnt lgkmcnt(0)
6336 lgkmcnt(0) and so do
6339 - s_waitcnt vmcnt(0)
6384 order. The s_waitcnt
6385 could be placed after
6389 make the s_waitcnt be
6396 instructions same as
6399 except must generate
6400 all instructions even
6402 store atomic seq_cst - singlethread - global *Same as corresponding
6403 - wavefront - local store atomic release,
6404 - workgroup - generic except must generate
6405 - agent all instructions even
6406 - system for OpenCL.*
6407 atomicrmw seq_cst - singlethread - global *Same as corresponding
6408 - wavefront - local atomicrmw acq_rel,
6409 - workgroup - generic except must generate
6410 - agent all instructions even
6411 - system for OpenCL.*
6412 fence seq_cst - singlethread *none* *Same as corresponding
6413 - wavefront fence acq_rel,
6414 - workgroup except must generate
6415 - agent all instructions even
6416 - system for OpenCL.*
6417 ============ ============ ============== ========== ================================
6419 .. _amdgpu-amdhsa-memory-model-gfx90a:
6426 * Each agent has multiple shader arrays (SA).
6427 * Each SA has multiple compute units (CU).
6428 * Each CU has multiple SIMDs that execute wavefronts.
6429 * The wavefronts for a single work-group are executed in the same CU but may be
6430 executed by different SIMDs. The exception is when in tgsplit execution mode
6431 when the wavefronts may be executed by different SIMDs in different CUs.
6432 * Each CU has a single LDS memory shared by the wavefronts of the work-groups
6433 executing on it. The exception is when in tgsplit execution mode when no LDS
6434 is allocated as wavefronts of the same work-group can be in different CUs.
6435 * All LDS operations of a CU are performed as wavefront wide operations in a
6436 global order and involve no caching. Completion is reported to a wavefront in
6438 * The LDS memory has multiple request queues shared by the SIMDs of a
6439 CU. Therefore, the LDS operations performed by different wavefronts of a
6440 work-group can be reordered relative to each other, which can result in
6441 reordering the visibility of vector memory operations with respect to LDS
6442 operations of other wavefronts in the same work-group. A ``s_waitcnt
6443 lgkmcnt(0)`` is required to ensure synchronization between LDS operations and
6444 vector memory operations between wavefronts of a work-group, but not between
6445 operations performed by the same wavefront.
6446 * The vector memory operations are performed as wavefront wide operations and
6447 completion is reported to a wavefront in execution order. The exception is
6448 that ``flat_load/store/atomic`` instructions can report out of vector memory
6449 order if they access LDS memory, and out of LDS operation order if they access
6451 * The vector memory operations access a single vector L1 cache shared by all
6452 SIMDs a CU. Therefore:
6454 * No special action is required for coherence between the lanes of a single
6457 * No special action is required for coherence between wavefronts in the same
6458 work-group since they execute on the same CU. The exception is when in
6459 tgsplit execution mode as wavefronts of the same work-group can be in
6460 different CUs and so a ``buffer_wbinvl1_vol`` is required as described in
6463 * A ``buffer_wbinvl1_vol`` is required for coherence between wavefronts
6464 executing in different work-groups as they may be executing on different
6467 * The scalar memory operations access a scalar L1 cache shared by all wavefronts
6468 on a group of CUs. The scalar and vector L1 caches are not coherent. However,
6469 scalar operations are used in a restricted way so do not impact the memory
6470 model. See :ref:`amdgpu-amdhsa-memory-spaces`.
6471 * The vector and scalar memory operations use an L2 cache shared by all CUs on
6474 * The L2 cache has independent channels to service disjoint ranges of virtual
6476 * Each CU has a separate request queue per channel. Therefore, the vector and
6477 scalar memory operations performed by wavefronts executing in different
6478 work-groups (which may be executing on different CUs), or the same
6479 work-group if executing in tgsplit mode, of an agent can be reordered
6480 relative to each other. A ``s_waitcnt vmcnt(0)`` is required to ensure
6481 synchronization between vector memory operations of different CUs. It
6482 ensures a previous vector memory operation has completed before executing a
6483 subsequent vector memory or LDS operation and so can be used to meet the
6484 requirements of acquire and release.
6485 * The L2 cache of one agent can be kept coherent with other agents by:
6486 using the MTYPE RW (read-write) or MTYPE CC (cache-coherent) with the PTE
6487 C-bit for memory local to the L2; and using the MTYPE NC (non-coherent) with
6488 the PTE C-bit set or MTYPE UC (uncached) for memory not local to the L2.
6490 * Any local memory cache lines will be automatically invalidated by writes
6491 from CUs associated with other L2 caches, or writes from the CPU, due to
6492 the cache probe caused by coherent requests. Coherent requests are caused
6493 by GPU accesses to pages with the PTE C-bit set, by CPU accesses over
6494 XGMI, and by PCIe requests that are configured to be coherent requests.
6495 * XGMI accesses from the CPU to local memory may be cached on the CPU.
6496 Subsequent access from the GPU will automatically invalidate or writeback
6497 the CPU cache due to the L2 probe filter and and the PTE C-bit being set.
6498 * Since all work-groups on the same agent share the same L2, no L2
6499 invalidation or writeback is required for coherence.
6500 * To ensure coherence of local and remote memory writes of work-groups in
6501 different agents a ``buffer_wbl2`` is required. It will writeback dirty L2
6502 cache lines of MTYPE RW (used for local coarse grain memory) and MTYPE NC
6503 ()used for remote coarse grain memory). Note that MTYPE CC (used for local
6504 fine grain memory) causes write through to DRAM, and MTYPE UC (used for
6505 remote fine grain memory) bypasses the L2, so both will never result in
6506 dirty L2 cache lines.
6507 * To ensure coherence of local and remote memory reads of work-groups in
6508 different agents a ``buffer_invl2`` is required. It will invalidate L2
6509 cache lines with MTYPE NC (used for remote coarse grain memory). Note that
6510 MTYPE CC (used for local fine grain memory) and MTYPE RW (used for local
6511 coarse memory) cause local reads to be invalidated by remote writes with
6512 with the PTE C-bit so these cache lines are not invalidated. Note that
6513 MTYPE UC (used for remote fine grain memory) bypasses the L2, so will
6514 never result in L2 cache lines that need to be invalidated.
6516 * PCIe access from the GPU to the CPU memory is kept coherent by using the
6517 MTYPE UC (uncached) which bypasses the L2.
6519 Scalar memory operations are only used to access memory that is proven to not
6520 change during the execution of the kernel dispatch. This includes constant
6521 address space and global address space for program scope ``const`` variables.
6522 Therefore, the kernel machine code does not have to maintain the scalar cache to
6523 ensure it is coherent with the vector caches. The scalar and vector caches are
6524 invalidated between kernel dispatches by CP since constant address space data
6525 may change between kernel dispatch executions. See
6526 :ref:`amdgpu-amdhsa-memory-spaces`.
6528 The one exception is if scalar writes are used to spill SGPR registers. In this
6529 case the AMDGPU backend ensures the memory location used to spill is never
6530 accessed by vector memory operations at the same time. If scalar writes are used
6531 then a ``s_dcache_wb`` is inserted before the ``s_endpgm`` and before a function
6532 return since the locations may be used for vector memory instructions by a
6533 future wavefront that uses the same scratch area, or a function call that
6534 creates a frame at the same address, respectively. There is no need for a
6535 ``s_dcache_inv`` as all scalar writes are write-before-read in the same thread.
6537 For kernarg backing memory:
6539 * CP invalidates the L1 cache at the start of each kernel dispatch.
6540 * On dGPU over XGMI or PCIe the kernarg backing memory is allocated in host
6541 memory accessed as MTYPE UC (uncached) to avoid needing to invalidate the L2
6542 cache. This also causes it to be treated as non-volatile and so is not
6543 invalidated by ``*_vol``.
6544 * On APU the kernarg backing memory is accessed as MTYPE CC (cache coherent) and
6545 so the L2 cache will be coherent with the CPU and other agents.
6547 Scratch backing memory (which is used for the private address space) is accessed
6548 with MTYPE NC_NV (non-coherent non-volatile). Since the private address space is
6549 only accessed by a single thread, and is always write-before-read, there is
6550 never a need to invalidate these entries from the L1 cache. Hence all cache
6551 invalidates are done as ``*_vol`` to only invalidate the volatile cache lines.
6553 The code sequences used to implement the memory model for GFX90A are defined
6554 in table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx90a-table`.
6556 .. table:: AMDHSA Memory Model Code Sequences GFX90A
6557 :name: amdgpu-amdhsa-memory-model-code-sequences-gfx90a-table
6559 ============ ============ ============== ========== ================================
6560 LLVM Instr LLVM Memory LLVM Memory AMDGPU AMDGPU Machine Code
6561 Ordering Sync Scope Address GFX90A
6563 ============ ============ ============== ========== ================================
6565 ------------------------------------------------------------------------------------
6566 load *none* *none* - global - !volatile & !nontemporal
6568 - private 1. buffer/global/flat_load
6570 - !volatile & nontemporal
6572 1. buffer/global/flat_load
6577 1. buffer/global/flat_load
6579 2. s_waitcnt vmcnt(0)
6581 - Must happen before
6582 any following volatile
6593 load *none* *none* - local 1. ds_load
6594 store *none* *none* - global - !volatile & !nontemporal
6596 - private 1. buffer/global/flat_store
6598 - !volatile & nontemporal
6600 1. buffer/global/flat_store
6605 1. buffer/global/flat_store
6606 2. s_waitcnt vmcnt(0)
6608 - Must happen before
6609 any following volatile
6620 store *none* *none* - local 1. ds_store
6621 **Unordered Atomic**
6622 ------------------------------------------------------------------------------------
6623 load atomic unordered *any* *any* *Same as non-atomic*.
6624 store atomic unordered *any* *any* *Same as non-atomic*.
6625 atomicrmw unordered *any* *any* *Same as monotonic atomic*.
6626 **Monotonic Atomic**
6627 ------------------------------------------------------------------------------------
6628 load atomic monotonic - singlethread - global 1. buffer/global/flat_load
6629 - wavefront - generic
6630 load atomic monotonic - workgroup - global 1. buffer/global/flat_load
6633 - If not TgSplit execution
6636 load atomic monotonic - singlethread - local *If TgSplit execution mode,
6637 - wavefront local address space cannot
6638 - workgroup be used.*
6641 load atomic monotonic - agent - global 1. buffer/global/flat_load
6643 load atomic monotonic - system - global 1. buffer/global/flat_load
6645 store atomic monotonic - singlethread - global 1. buffer/global/flat_store
6646 - wavefront - generic
6649 store atomic monotonic - system - global 1. buffer/global/flat_store
6651 store atomic monotonic - singlethread - local *If TgSplit execution mode,
6652 - wavefront local address space cannot
6653 - workgroup be used.*
6656 atomicrmw monotonic - singlethread - global 1. buffer/global/flat_atomic
6657 - wavefront - generic
6660 atomicrmw monotonic - system - global 1. buffer/global/flat_atomic
6662 atomicrmw monotonic - singlethread - local *If TgSplit execution mode,
6663 - wavefront local address space cannot
6664 - workgroup be used.*
6668 ------------------------------------------------------------------------------------
6669 load atomic acquire - singlethread - global 1. buffer/global/ds/flat_load
6672 load atomic acquire - workgroup - global 1. buffer/global_load glc=1
6674 - If not TgSplit execution
6677 2. s_waitcnt vmcnt(0)
6679 - If not TgSplit execution
6681 - Must happen before the
6682 following buffer_wbinvl1_vol.
6684 3. buffer_wbinvl1_vol
6686 - If not TgSplit execution
6688 - Must happen before
6699 load atomic acquire - workgroup - local *If TgSplit execution mode,
6700 local address space cannot
6704 2. s_waitcnt lgkmcnt(0)
6707 - Must happen before
6716 older than the local load
6720 load atomic acquire - workgroup - generic 1. flat_load glc=1
6722 - If not TgSplit execution
6725 2. s_waitcnt lgkm/vmcnt(0)
6727 - Use lgkmcnt(0) if not
6728 TgSplit execution mode
6729 and vmcnt(0) if TgSplit
6731 - If OpenCL, omit lgkmcnt(0).
6732 - Must happen before
6734 buffer_wbinvl1_vol and any
6735 following global/generic
6742 older than a local load
6746 3. buffer_wbinvl1_vol
6748 - If not TgSplit execution
6755 load atomic acquire - agent - global 1. buffer/global_load
6757 2. s_waitcnt vmcnt(0)
6759 - Must happen before
6767 3. buffer_wbinvl1_vol
6769 - Must happen before
6779 load atomic acquire - system - global 1. buffer/global/flat_load
6781 2. s_waitcnt vmcnt(0)
6783 - Must happen before
6784 following buffer_invl2 and
6794 - Must happen before
6802 stale L1 global data,
6803 nor see stale L2 MTYPE
6805 MTYPE RW and CC memory will
6806 never be stale in L2 due to
6809 load atomic acquire - agent - generic 1. flat_load glc=1
6810 2. s_waitcnt vmcnt(0) &
6813 - If TgSplit execution mode,
6817 - Must happen before
6820 - Ensures the flat_load
6825 3. buffer_wbinvl1_vol
6827 - Must happen before
6837 load atomic acquire - system - generic 1. flat_load glc=1
6838 2. s_waitcnt vmcnt(0) &
6841 - If TgSplit execution mode,
6845 - Must happen before
6849 - Ensures the flat_load
6857 - Must happen before
6865 stale L1 global data,
6866 nor see stale L2 MTYPE
6868 MTYPE RW and CC memory will
6869 never be stale in L2 due to
6872 atomicrmw acquire - singlethread - global 1. buffer/global/flat_atomic
6873 - wavefront - generic
6874 atomicrmw acquire - singlethread - local *If TgSplit execution mode,
6875 - wavefront local address space cannot
6879 atomicrmw acquire - workgroup - global 1. buffer/global_atomic
6880 2. s_waitcnt vmcnt(0)
6882 - If not TgSplit execution
6884 - Must happen before the
6885 following buffer_wbinvl1_vol.
6886 - Ensures the atomicrmw
6891 3. buffer_wbinvl1_vol
6893 - If not TgSplit execution
6895 - Must happen before
6905 atomicrmw acquire - workgroup - local *If TgSplit execution mode,
6906 local address space cannot
6910 2. s_waitcnt lgkmcnt(0)
6913 - Must happen before
6922 older than the local
6926 atomicrmw acquire - workgroup - generic 1. flat_atomic
6927 2. s_waitcnt lgkm/vmcnt(0)
6929 - Use lgkmcnt(0) if not
6930 TgSplit execution mode
6931 and vmcnt(0) if TgSplit
6933 - If OpenCL, omit lgkmcnt(0).
6934 - Must happen before
6936 buffer_wbinvl1_vol and
6949 3. buffer_wbinvl1_vol
6951 - If not TgSplit execution
6958 atomicrmw acquire - agent - global 1. buffer/global_atomic
6959 2. s_waitcnt vmcnt(0)
6961 - Must happen before
6970 3. buffer_wbinvl1_vol
6972 - Must happen before
6982 atomicrmw acquire - system - global 1. buffer/global_atomic
6983 2. s_waitcnt vmcnt(0)
6985 - Must happen before
6986 following buffer_invl2 and
6997 - Must happen before
7005 stale L1 global data,
7006 nor see stale L2 MTYPE
7008 MTYPE RW and CC memory will
7009 never be stale in L2 due to
7012 atomicrmw acquire - agent - generic 1. flat_atomic
7013 2. s_waitcnt vmcnt(0) &
7016 - If TgSplit execution mode,
7020 - Must happen before
7029 3. buffer_wbinvl1_vol
7031 - Must happen before
7041 atomicrmw acquire - system - generic 1. flat_atomic
7042 2. s_waitcnt vmcnt(0) &
7045 - If TgSplit execution mode,
7049 - Must happen before
7062 - Must happen before
7070 stale L1 global data,
7071 nor see stale L2 MTYPE
7073 MTYPE RW and CC memory will
7074 never be stale in L2 due to
7077 fence acquire - singlethread *none* *none*
7079 fence acquire - workgroup *none* 1. s_waitcnt lgkm/vmcnt(0)
7081 - Use lgkmcnt(0) if not
7082 TgSplit execution mode
7083 and vmcnt(0) if TgSplit
7093 - However, since LLVM
7108 - s_waitcnt vmcnt(0)
7120 fence-paired-atomic).
7121 - s_waitcnt lgkmcnt(0)
7132 fence-paired-atomic).
7133 - Must happen before
7135 buffer_wbinvl1_vol and
7146 fence-paired-atomic.
7148 2. buffer_wbinvl1_vol
7150 - If not TgSplit execution
7157 fence acquire - agent *none* 1. s_waitcnt lgkmcnt(0) &
7160 - If TgSplit execution mode,
7166 - However, since LLVM
7174 - Could be split into
7183 - s_waitcnt vmcnt(0)
7194 fence-paired-atomic).
7195 - s_waitcnt lgkmcnt(0)
7206 fence-paired-atomic).
7207 - Must happen before
7221 fence-paired-atomic.
7223 2. buffer_wbinvl1_vol
7225 - Must happen before any
7226 following global/generic
7235 fence acquire - system *none* 1. s_waitcnt lgkmcnt(0) &
7238 - If TgSplit execution mode,
7244 - However, since LLVM
7252 - Could be split into
7261 - s_waitcnt vmcnt(0)
7272 fence-paired-atomic).
7273 - s_waitcnt lgkmcnt(0)
7284 fence-paired-atomic).
7285 - Must happen before
7286 the following buffer_invl2 and
7299 fence-paired-atomic.
7304 - Must happen before any
7305 following global/generic
7312 stale L1 global data,
7313 nor see stale L2 MTYPE
7315 MTYPE RW and CC memory will
7316 never be stale in L2 due to
7319 ------------------------------------------------------------------------------------
7320 store atomic release - singlethread - global 1. buffer/global/flat_store
7321 - wavefront - generic
7322 store atomic release - singlethread - local *If TgSplit execution mode,
7323 - wavefront local address space cannot
7327 store atomic release - workgroup - global 1. s_waitcnt lgkm/vmcnt(0)
7329 - Use lgkmcnt(0) if not
7330 TgSplit execution mode
7331 and vmcnt(0) if TgSplit
7333 - If OpenCL, omit lgkmcnt(0).
7334 - s_waitcnt vmcnt(0)
7337 global/generic load/store/
7338 load atomic/store atomic/
7340 - s_waitcnt lgkmcnt(0)
7347 - Must happen before
7358 2. buffer/global/flat_store
7359 store atomic release - workgroup - local *If TgSplit execution mode,
7360 local address space cannot
7364 store atomic release - agent - global 1. s_waitcnt lgkmcnt(0) &
7367 - If TgSplit execution mode,
7373 - Could be split into
7382 - s_waitcnt vmcnt(0)
7389 - s_waitcnt lgkmcnt(0)
7396 - Must happen before
7407 2. buffer/global/flat_store
7408 store atomic release - system - global 1. buffer_wbl2
7410 - Must happen before
7411 following s_waitcnt.
7412 - Performs L2 writeback to
7416 visible at system scope.
7418 2. s_waitcnt lgkmcnt(0) &
7421 - If TgSplit execution mode,
7427 - Could be split into
7436 - s_waitcnt vmcnt(0)
7437 must happen after any
7443 - s_waitcnt lgkmcnt(0)
7444 must happen after any
7450 - Must happen before
7455 to memory and the L2
7462 3. buffer/global/flat_store
7463 atomicrmw release - singlethread - global 1. buffer/global/flat_atomic
7464 - wavefront - generic
7465 atomicrmw release - singlethread - local *If TgSplit execution mode,
7466 - wavefront local address space cannot
7470 atomicrmw release - workgroup - global 1. s_waitcnt lgkm/vmcnt(0)
7472 - Use lgkmcnt(0) if not
7473 TgSplit execution mode
7474 and vmcnt(0) if TgSplit
7478 - s_waitcnt vmcnt(0)
7481 global/generic load/store/
7482 load atomic/store atomic/
7484 - s_waitcnt lgkmcnt(0)
7491 - Must happen before
7502 2. buffer/global/flat_atomic
7503 atomicrmw release - workgroup - local *If TgSplit execution mode,
7504 local address space cannot
7508 atomicrmw release - agent - global 1. s_waitcnt lgkmcnt(0) &
7511 - If TgSplit execution mode,
7515 - Could be split into
7524 - s_waitcnt vmcnt(0)
7531 - s_waitcnt lgkmcnt(0)
7538 - Must happen before
7549 2. buffer/global/flat_atomic
7550 atomicrmw release - system - global 1. buffer_wbl2
7552 - Must happen before
7553 following s_waitcnt.
7554 - Performs L2 writeback to
7558 visible at system scope.
7560 2. s_waitcnt lgkmcnt(0) &
7563 - If TgSplit execution mode,
7567 - Could be split into
7576 - s_waitcnt vmcnt(0)
7583 - s_waitcnt lgkmcnt(0)
7590 - Must happen before
7595 to memory and the L2
7602 3. buffer/global/flat_atomic
7603 fence release - singlethread *none* *none*
7605 fence release - workgroup *none* 1. s_waitcnt lgkm/vmcnt(0)
7607 - Use lgkmcnt(0) if not
7608 TgSplit execution mode
7609 and vmcnt(0) if TgSplit
7619 - However, since LLVM
7634 - s_waitcnt vmcnt(0)
7639 load atomic/store atomic/
7641 - s_waitcnt lgkmcnt(0)
7648 - Must happen before
7657 fence-paired-atomic).
7664 fence-paired-atomic.
7666 fence release - agent *none* 1. s_waitcnt lgkmcnt(0) &
7669 - If TgSplit execution mode,
7679 - However, since LLVM
7694 - Could be split into
7703 - s_waitcnt vmcnt(0)
7710 - s_waitcnt lgkmcnt(0)
7717 - Must happen before
7726 fence-paired-atomic).
7733 fence-paired-atomic.
7735 fence release - system *none* 1. buffer_wbl2
7740 - Must happen before
7741 following s_waitcnt.
7742 - Performs L2 writeback to
7746 visible at system scope.
7748 2. s_waitcnt lgkmcnt(0) &
7751 - If TgSplit execution mode,
7761 - However, since LLVM
7776 - Could be split into
7785 - s_waitcnt vmcnt(0)
7792 - s_waitcnt lgkmcnt(0)
7799 - Must happen before
7808 fence-paired-atomic).
7815 fence-paired-atomic.
7817 **Acquire-Release Atomic**
7818 ------------------------------------------------------------------------------------
7819 atomicrmw acq_rel - singlethread - global 1. buffer/global/flat_atomic
7820 - wavefront - generic
7821 atomicrmw acq_rel - singlethread - local *If TgSplit execution mode,
7822 - wavefront local address space cannot
7826 atomicrmw acq_rel - workgroup - global 1. s_waitcnt lgkm/vmcnt(0)
7828 - Use lgkmcnt(0) if not
7829 TgSplit execution mode
7830 and vmcnt(0) if TgSplit
7840 - s_waitcnt vmcnt(0)
7843 global/generic load/store/
7844 load atomic/store atomic/
7846 - s_waitcnt lgkmcnt(0)
7853 - Must happen before
7864 2. buffer/global_atomic
7865 3. s_waitcnt vmcnt(0)
7867 - If not TgSplit execution
7869 - Must happen before
7879 4. buffer_wbinvl1_vol
7881 - If not TgSplit execution
7888 atomicrmw acq_rel - workgroup - local *If TgSplit execution mode,
7889 local address space cannot
7893 2. s_waitcnt lgkmcnt(0)
7896 - Must happen before
7905 older than the local load
7909 atomicrmw acq_rel - workgroup - generic 1. s_waitcnt lgkm/vmcnt(0)
7911 - Use lgkmcnt(0) if not
7912 TgSplit execution mode
7913 and vmcnt(0) if TgSplit
7917 - s_waitcnt vmcnt(0)
7920 global/generic load/store/
7921 load atomic/store atomic/
7923 - s_waitcnt lgkmcnt(0)
7930 - Must happen before
7942 3. s_waitcnt lgkmcnt(0) &
7945 - If not TgSplit execution
7946 mode, omit vmcnt(0).
7949 - Must happen before
7951 buffer_wbinvl1_vol and
7960 older than a local load
7964 3. buffer_wbinvl1_vol
7966 - If not TgSplit execution
7973 atomicrmw acq_rel - agent - global 1. s_waitcnt lgkmcnt(0) &
7976 - If TgSplit execution mode,
7980 - Could be split into
7989 - s_waitcnt vmcnt(0)
7996 - s_waitcnt lgkmcnt(0)
8003 - Must happen before
8014 2. buffer/global_atomic
8015 3. s_waitcnt vmcnt(0)
8017 - Must happen before
8026 4. buffer_wbinvl1_vol
8028 - Must happen before
8038 atomicrmw acq_rel - system - global 1. buffer_wbl2
8040 - Must happen before
8041 following s_waitcnt.
8042 - Performs L2 writeback to
8046 visible at system scope.
8048 2. s_waitcnt lgkmcnt(0) &
8051 - If TgSplit execution mode,
8055 - Could be split into
8064 - s_waitcnt vmcnt(0)
8071 - s_waitcnt lgkmcnt(0)
8078 - Must happen before
8083 to global and L2 writeback
8084 have completed before
8089 3. buffer/global_atomic
8090 4. s_waitcnt vmcnt(0)
8092 - Must happen before
8093 following buffer_invl2 and
8104 - Must happen before
8112 stale L1 global data,
8113 nor see stale L2 MTYPE
8115 MTYPE RW and CC memory will
8116 never be stale in L2 due to
8119 atomicrmw acq_rel - agent - generic 1. s_waitcnt lgkmcnt(0) &
8122 - If TgSplit execution mode,
8126 - Could be split into
8135 - s_waitcnt vmcnt(0)
8142 - s_waitcnt lgkmcnt(0)
8149 - Must happen before
8161 3. s_waitcnt vmcnt(0) &
8164 - If TgSplit execution mode,
8168 - Must happen before
8177 4. buffer_wbinvl1_vol
8179 - Must happen before
8189 atomicrmw acq_rel - system - generic 1. buffer_wbl2
8191 - Must happen before
8192 following s_waitcnt.
8193 - Performs L2 writeback to
8197 visible at system scope.
8199 2. s_waitcnt lgkmcnt(0) &
8202 - If TgSplit execution mode,
8206 - Could be split into
8215 - s_waitcnt vmcnt(0)
8222 - s_waitcnt lgkmcnt(0)
8229 - Must happen before
8234 to global and L2 writeback
8235 have completed before
8241 4. s_waitcnt vmcnt(0) &
8244 - If TgSplit execution mode,
8248 - Must happen before
8249 following buffer_invl2 and
8260 - Must happen before
8268 stale L1 global data,
8269 nor see stale L2 MTYPE
8271 MTYPE RW and CC memory will
8272 never be stale in L2 due to
8275 fence acq_rel - singlethread *none* *none*
8277 fence acq_rel - workgroup *none* 1. s_waitcnt lgkm/vmcnt(0)
8279 - Use lgkmcnt(0) if not
8280 TgSplit execution mode
8281 and vmcnt(0) if TgSplit
8300 - s_waitcnt vmcnt(0)
8305 load atomic/store atomic/
8307 - s_waitcnt lgkmcnt(0)
8314 - Must happen before
8337 acquire-fence-paired-atomic)
8358 release-fence-paired-atomic).
8362 - Must happen before
8366 acquire-fence-paired
8367 atomic has completed
8376 acquire-fence-paired-atomic.
8378 2. buffer_wbinvl1_vol
8380 - If not TgSplit execution
8387 fence acq_rel - agent *none* 1. s_waitcnt lgkmcnt(0) &
8390 - If TgSplit execution mode,
8396 - However, since LLVM
8404 - Could be split into
8413 - s_waitcnt vmcnt(0)
8420 - s_waitcnt lgkmcnt(0)
8427 - Must happen before
8432 global/local/generic
8441 acquire-fence-paired-atomic)
8453 global/local/generic
8462 release-fence-paired-atomic).
8467 2. buffer_wbinvl1_vol
8469 - Must happen before
8483 fence acq_rel - system *none* 1. buffer_wbl2
8488 - Must happen before
8489 following s_waitcnt.
8490 - Performs L2 writeback to
8494 visible at system scope.
8496 2. s_waitcnt lgkmcnt(0) &
8499 - If TgSplit execution mode,
8505 - However, since LLVM
8513 - Could be split into
8522 - s_waitcnt vmcnt(0)
8529 - s_waitcnt lgkmcnt(0)
8536 - Must happen before
8537 the following buffer_invl2 and
8541 global/local/generic
8550 acquire-fence-paired-atomic)
8562 global/local/generic
8571 release-fence-paired-atomic).
8579 - Must happen before
8588 stale L1 global data,
8589 nor see stale L2 MTYPE
8591 MTYPE RW and CC memory will
8592 never be stale in L2 due to
8595 **Sequential Consistent Atomic**
8596 ------------------------------------------------------------------------------------
8597 load atomic seq_cst - singlethread - global *Same as corresponding
8598 - wavefront - local load atomic acquire,
8599 - generic except must generate
8600 all instructions even
8602 load atomic seq_cst - workgroup - global 1. s_waitcnt lgkm/vmcnt(0)
8604 - Use lgkmcnt(0) if not
8605 TgSplit execution mode
8606 and vmcnt(0) if TgSplit
8608 - s_waitcnt lgkmcnt(0) must
8621 lgkmcnt(0) and so do
8624 - s_waitcnt vmcnt(0)
8643 consistent global/local
8669 order. The s_waitcnt
8670 could be placed after
8674 make the s_waitcnt be
8681 instructions same as
8684 except must generate
8685 all instructions even
8687 load atomic seq_cst - workgroup - local *If TgSplit execution mode,
8688 local address space cannot
8691 *Same as corresponding
8692 load atomic acquire,
8693 except must generate
8694 all instructions even
8697 load atomic seq_cst - agent - global 1. s_waitcnt lgkmcnt(0) &
8698 - system - generic vmcnt(0)
8700 - If TgSplit execution mode,
8702 - Could be split into
8711 - s_waitcnt lgkmcnt(0)
8724 lgkmcnt(0) and so do
8727 - s_waitcnt vmcnt(0)
8772 order. The s_waitcnt
8773 could be placed after
8777 make the s_waitcnt be
8784 instructions same as
8787 except must generate
8788 all instructions even
8790 store atomic seq_cst - singlethread - global *Same as corresponding
8791 - wavefront - local store atomic release,
8792 - workgroup - generic except must generate
8793 - agent all instructions even
8794 - system for OpenCL.*
8795 atomicrmw seq_cst - singlethread - global *Same as corresponding
8796 - wavefront - local atomicrmw acq_rel,
8797 - workgroup - generic except must generate
8798 - agent all instructions even
8799 - system for OpenCL.*
8800 fence seq_cst - singlethread *none* *Same as corresponding
8801 - wavefront fence acq_rel,
8802 - workgroup except must generate
8803 - agent all instructions even
8804 - system for OpenCL.*
8805 ============ ============ ============== ========== ================================
8807 .. _amdgpu-amdhsa-memory-model-gfx940:
8814 * Each agent has multiple shader arrays (SA).
8815 * Each SA has multiple compute units (CU).
8816 * Each CU has multiple SIMDs that execute wavefronts.
8817 * The wavefronts for a single work-group are executed in the same CU but may be
8818 executed by different SIMDs. The exception is when in tgsplit execution mode
8819 when the wavefronts may be executed by different SIMDs in different CUs.
8820 * Each CU has a single LDS memory shared by the wavefronts of the work-groups
8821 executing on it. The exception is when in tgsplit execution mode when no LDS
8822 is allocated as wavefronts of the same work-group can be in different CUs.
8823 * All LDS operations of a CU are performed as wavefront wide operations in a
8824 global order and involve no caching. Completion is reported to a wavefront in
8826 * The LDS memory has multiple request queues shared by the SIMDs of a
8827 CU. Therefore, the LDS operations performed by different wavefronts of a
8828 work-group can be reordered relative to each other, which can result in
8829 reordering the visibility of vector memory operations with respect to LDS
8830 operations of other wavefronts in the same work-group. A ``s_waitcnt
8831 lgkmcnt(0)`` is required to ensure synchronization between LDS operations and
8832 vector memory operations between wavefronts of a work-group, but not between
8833 operations performed by the same wavefront.
8834 * The vector memory operations are performed as wavefront wide operations and
8835 completion is reported to a wavefront in execution order. The exception is
8836 that ``flat_load/store/atomic`` instructions can report out of vector memory
8837 order if they access LDS memory, and out of LDS operation order if they access
8839 * The vector memory operations access a single vector L1 cache shared by all
8840 SIMDs a CU. Therefore:
8842 * No special action is required for coherence between the lanes of a single
8845 * No special action is required for coherence between wavefronts in the same
8846 work-group since they execute on the same CU. The exception is when in
8847 tgsplit execution mode as wavefronts of the same work-group can be in
8848 different CUs and so a ``buffer_inv sc0`` is required which will invalidate
8851 * A ``buffer_inv sc0`` is required to invalidate the L1 cache for coherence
8852 between wavefronts executing in different work-groups as they may be
8853 executing on different CUs.
8855 * Atomic read-modify-write instructions implicitly bypass the L1 cache.
8856 Therefore, they do not use the sc0 bit for coherence and instead use it to
8857 indicate if the instruction returns the original value being updated. They
8858 do use sc1 to indicate system or agent scope coherence.
8860 * The scalar memory operations access a scalar L1 cache shared by all wavefronts
8861 on a group of CUs. The scalar and vector L1 caches are not coherent. However,
8862 scalar operations are used in a restricted way so do not impact the memory
8863 model. See :ref:`amdgpu-amdhsa-memory-spaces`.
8864 * The vector and scalar memory operations use an L2 cache.
8866 * The gfx940 can be configured as a number of smaller agents with each having
8867 a single L2 shared by all CUs on the same agent, or as fewer (possibly one)
8868 larger agents with groups of CUs on each agent each sharing separate L2
8870 * The L2 cache has independent channels to service disjoint ranges of virtual
8872 * Each CU has a separate request queue per channel for its associated L2.
8873 Therefore, the vector and scalar memory operations performed by wavefronts
8874 executing with different L1 caches and the same L2 cache can be reordered
8875 relative to each other.
8876 * A ``s_waitcnt vmcnt(0)`` is required to ensure synchronization between
8877 vector memory operations of different CUs. It ensures a previous vector
8878 memory operation has completed before executing a subsequent vector memory
8879 or LDS operation and so can be used to meet the requirements of acquire and
8881 * An L2 cache can be kept coherent with other L2 caches by using the MTYPE RW
8882 (read-write) for memory local to the L2, and MTYPE NC (non-coherent) with
8883 the PTE C-bit set for memory not local to the L2.
8885 * Any local memory cache lines will be automatically invalidated by writes
8886 from CUs associated with other L2 caches, or writes from the CPU, due to
8887 the cache probe caused by the PTE C-bit.
8888 * XGMI accesses from the CPU to local memory may be cached on the CPU.
8889 Subsequent access from the GPU will automatically invalidate or writeback
8890 the CPU cache due to the L2 probe filter.
8891 * To ensure coherence of local memory writes of CUs with different L1 caches
8892 in the same agent a ``buffer_wbl2`` is required. It does nothing if the
8893 agent is configured to have a single L2, or will writeback dirty L2 cache
8894 lines if configured to have multiple L2 caches.
8895 * To ensure coherence of local memory writes of CUs in different agents a
8896 ``buffer_wbl2 sc1`` is required. It will writeback dirty L2 cache lines.
8897 * To ensure coherence of local memory reads of CUs with different L1 caches
8898 in the same agent a ``buffer_inv sc1`` is required. It does nothing if the
8899 agent is configured to have a single L2, or will invalidate non-local L2
8900 cache lines if configured to have multiple L2 caches.
8901 * To ensure coherence of local memory reads of CUs in different agents a
8902 ``buffer_inv sc0 sc1`` is required. It will invalidate non-local L2 cache
8903 lines if configured to have multiple L2 caches.
8905 * PCIe access from the GPU to the CPU can be kept coherent by using the MTYPE
8906 UC (uncached) which bypasses the L2.
8908 Scalar memory operations are only used to access memory that is proven to not
8909 change during the execution of the kernel dispatch. This includes constant
8910 address space and global address space for program scope ``const`` variables.
8911 Therefore, the kernel machine code does not have to maintain the scalar cache to
8912 ensure it is coherent with the vector caches. The scalar and vector caches are
8913 invalidated between kernel dispatches by CP since constant address space data
8914 may change between kernel dispatch executions. See
8915 :ref:`amdgpu-amdhsa-memory-spaces`.
8917 The one exception is if scalar writes are used to spill SGPR registers. In this
8918 case the AMDGPU backend ensures the memory location used to spill is never
8919 accessed by vector memory operations at the same time. If scalar writes are used
8920 then a ``s_dcache_wb`` is inserted before the ``s_endpgm`` and before a function
8921 return since the locations may be used for vector memory instructions by a
8922 future wavefront that uses the same scratch area, or a function call that
8923 creates a frame at the same address, respectively. There is no need for a
8924 ``s_dcache_inv`` as all scalar writes are write-before-read in the same thread.
8926 For kernarg backing memory:
8928 * CP invalidates the L1 cache at the start of each kernel dispatch.
8929 * On dGPU over XGMI or PCIe the kernarg backing memory is allocated in host
8930 memory accessed as MTYPE UC (uncached) to avoid needing to invalidate the L2
8931 cache. This also causes it to be treated as non-volatile and so is not
8932 invalidated by ``*_vol``.
8933 * On APU the kernarg backing memory is accessed as MTYPE CC (cache coherent) and
8934 so the L2 cache will be coherent with the CPU and other agents.
8936 Scratch backing memory (which is used for the private address space) is accessed
8937 with MTYPE NC_NV (non-coherent non-volatile). Since the private address space is
8938 only accessed by a single thread, and is always write-before-read, there is
8939 never a need to invalidate these entries from the L1 cache. Hence all cache
8940 invalidates are done as ``*_vol`` to only invalidate the volatile cache lines.
8942 The code sequences used to implement the memory model for GFX940 are defined
8943 in table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx940-table`.
8945 .. table:: AMDHSA Memory Model Code Sequences GFX940
8946 :name: amdgpu-amdhsa-memory-model-code-sequences-gfx940-table
8948 ============ ============ ============== ========== ================================
8949 LLVM Instr LLVM Memory LLVM Memory AMDGPU AMDGPU Machine Code
8950 Ordering Sync Scope Address GFX940
8952 ============ ============ ============== ========== ================================
8954 ------------------------------------------------------------------------------------
8955 load *none* *none* - global - !volatile & !nontemporal
8957 - private 1. buffer/global/flat_load
8959 - !volatile & nontemporal
8961 1. buffer/global/flat_load
8966 1. buffer/global/flat_load
8968 2. s_waitcnt vmcnt(0)
8970 - Must happen before
8971 any following volatile
8982 load *none* *none* - local 1. ds_load
8983 store *none* *none* - global - !volatile & !nontemporal
8985 - private 1. buffer/global/flat_store
8987 - !volatile & nontemporal
8989 1. buffer/global/flat_store
8994 1. buffer/global/flat_store
8996 2. s_waitcnt vmcnt(0)
8998 - Must happen before
8999 any following volatile
9010 store *none* *none* - local 1. ds_store
9011 **Unordered Atomic**
9012 ------------------------------------------------------------------------------------
9013 load atomic unordered *any* *any* *Same as non-atomic*.
9014 store atomic unordered *any* *any* *Same as non-atomic*.
9015 atomicrmw unordered *any* *any* *Same as monotonic atomic*.
9016 **Monotonic Atomic**
9017 ------------------------------------------------------------------------------------
9018 load atomic monotonic - singlethread - global 1. buffer/global/flat_load
9019 - wavefront - generic
9020 load atomic monotonic - workgroup - global 1. buffer/global/flat_load
9022 load atomic monotonic - singlethread - local *If TgSplit execution mode,
9023 - wavefront local address space cannot
9024 - workgroup be used.*
9027 load atomic monotonic - agent - global 1. buffer/global/flat_load
9029 load atomic monotonic - system - global 1. buffer/global/flat_load
9030 - generic sc0=1 sc1=1
9031 store atomic monotonic - singlethread - global 1. buffer/global/flat_store
9032 - wavefront - generic
9033 store atomic monotonic - workgroup - global 1. buffer/global/flat_store
9035 store atomic monotonic - agent - global 1. buffer/global/flat_store
9037 store atomic monotonic - system - global 1. buffer/global/flat_store
9038 - generic sc0=1 sc1=1
9039 store atomic monotonic - singlethread - local *If TgSplit execution mode,
9040 - wavefront local address space cannot
9041 - workgroup be used.*
9044 atomicrmw monotonic - singlethread - global 1. buffer/global/flat_atomic
9045 - wavefront - generic
9048 atomicrmw monotonic - system - global 1. buffer/global/flat_atomic
9050 atomicrmw monotonic - singlethread - local *If TgSplit execution mode,
9051 - wavefront local address space cannot
9052 - workgroup be used.*
9056 ------------------------------------------------------------------------------------
9057 load atomic acquire - singlethread - global 1. buffer/global/ds/flat_load
9060 load atomic acquire - workgroup - global 1. buffer/global_load sc0=1
9061 2. s_waitcnt vmcnt(0)
9063 - If not TgSplit execution
9065 - Must happen before the
9066 following buffer_inv.
9070 - If not TgSplit execution
9072 - Must happen before
9083 load atomic acquire - workgroup - local *If TgSplit execution mode,
9084 local address space cannot
9088 2. s_waitcnt lgkmcnt(0)
9091 - Must happen before
9100 older than the local load
9104 load atomic acquire - workgroup - generic 1. flat_load sc0=1
9105 2. s_waitcnt lgkm/vmcnt(0)
9107 - Use lgkmcnt(0) if not
9108 TgSplit execution mode
9109 and vmcnt(0) if TgSplit
9111 - If OpenCL, omit lgkmcnt(0).
9112 - Must happen before
9115 following global/generic
9122 older than a local load
9128 - If not TgSplit execution
9135 load atomic acquire - agent - global 1. buffer/global_load
9137 2. s_waitcnt vmcnt(0)
9139 - Must happen before
9149 - Must happen before
9159 load atomic acquire - system - global 1. buffer/global/flat_load
9161 2. s_waitcnt vmcnt(0)
9163 - Must happen before
9171 3. buffer_inv sc0=1 sc1=1
9173 - Must happen before
9181 stale MTYPE NC global data.
9182 MTYPE RW and CC memory will
9183 never be stale due to the
9186 load atomic acquire - agent - generic 1. flat_load sc1=1
9187 2. s_waitcnt vmcnt(0) &
9190 - If TgSplit execution mode,
9194 - Must happen before
9197 - Ensures the flat_load
9204 - Must happen before
9214 load atomic acquire - system - generic 1. flat_load sc0=1 sc1=1
9215 2. s_waitcnt vmcnt(0) &
9218 - If TgSplit execution mode,
9222 - Must happen before
9225 - Ensures the flat_load
9230 3. buffer_inv sc0=1 sc1=1
9232 - Must happen before
9240 stale MTYPE NC global data.
9241 MTYPE RW and CC memory will
9242 never be stale due to the
9245 atomicrmw acquire - singlethread - global 1. buffer/global/flat_atomic
9246 - wavefront - generic
9247 atomicrmw acquire - singlethread - local *If TgSplit execution mode,
9248 - wavefront local address space cannot
9252 atomicrmw acquire - workgroup - global 1. buffer/global_atomic
9253 2. s_waitcnt vmcnt(0)
9255 - If not TgSplit execution
9257 - Must happen before the
9258 following buffer_inv.
9259 - Ensures the atomicrmw
9266 - If not TgSplit execution
9268 - Must happen before
9278 atomicrmw acquire - workgroup - local *If TgSplit execution mode,
9279 local address space cannot
9283 2. s_waitcnt lgkmcnt(0)
9286 - Must happen before
9295 older than the local
9299 atomicrmw acquire - workgroup - generic 1. flat_atomic
9300 2. s_waitcnt lgkm/vmcnt(0)
9302 - Use lgkmcnt(0) if not
9303 TgSplit execution mode
9304 and vmcnt(0) if TgSplit
9306 - If OpenCL, omit lgkmcnt(0).
9307 - Must happen before
9324 - If not TgSplit execution
9331 atomicrmw acquire - agent - global 1. buffer/global_atomic
9332 2. s_waitcnt vmcnt(0)
9334 - Must happen before
9345 - Must happen before
9355 atomicrmw acquire - system - global 1. buffer/global_atomic
9357 2. s_waitcnt vmcnt(0)
9359 - Must happen before
9368 3. buffer_inv sc0=1 sc1=1
9370 - Must happen before
9378 stale MTYPE NC global data.
9379 MTYPE RW and CC memory will
9380 never be stale due to the
9383 atomicrmw acquire - agent - generic 1. flat_atomic
9384 2. s_waitcnt vmcnt(0) &
9387 - If TgSplit execution mode,
9391 - Must happen before
9402 - Must happen before
9412 atomicrmw acquire - system - generic 1. flat_atomic sc1=1
9413 2. s_waitcnt vmcnt(0) &
9416 - If TgSplit execution mode,
9420 - Must happen before
9429 3. buffer_inv sc0=1 sc1=1
9431 - Must happen before
9439 stale MTYPE NC global data.
9440 MTYPE RW and CC memory will
9441 never be stale due to the
9444 fence acquire - singlethread *none* *none*
9446 fence acquire - workgroup *none* 1. s_waitcnt lgkm/vmcnt(0)
9448 - Use lgkmcnt(0) if not
9449 TgSplit execution mode
9450 and vmcnt(0) if TgSplit
9460 - However, since LLVM
9475 - s_waitcnt vmcnt(0)
9487 fence-paired-atomic).
9488 - s_waitcnt lgkmcnt(0)
9499 fence-paired-atomic).
9500 - Must happen before
9513 fence-paired-atomic.
9517 - If not TgSplit execution
9524 fence acquire - agent *none* 1. s_waitcnt lgkmcnt(0) &
9527 - If TgSplit execution mode,
9533 - However, since LLVM
9541 - Could be split into
9550 - s_waitcnt vmcnt(0)
9561 fence-paired-atomic).
9562 - s_waitcnt lgkmcnt(0)
9573 fence-paired-atomic).
9574 - Must happen before
9588 fence-paired-atomic.
9592 - Must happen before any
9593 following global/generic
9602 fence acquire - system *none* 1. s_waitcnt lgkmcnt(0) &
9605 - If TgSplit execution mode,
9611 - However, since LLVM
9619 - Could be split into
9628 - s_waitcnt vmcnt(0)
9639 fence-paired-atomic).
9640 - s_waitcnt lgkmcnt(0)
9651 fence-paired-atomic).
9652 - Must happen before
9666 fence-paired-atomic.
9668 2. buffer_inv sc0=1 sc1=1
9670 - Must happen before any
9671 following global/generic
9681 ------------------------------------------------------------------------------------
9682 store atomic release - singlethread - global 1. buffer/global/flat_store
9683 - wavefront - generic
9684 store atomic release - singlethread - local *If TgSplit execution mode,
9685 - wavefront local address space cannot
9689 store atomic release - workgroup - global 1. s_waitcnt lgkm/vmcnt(0)
9691 - Use lgkmcnt(0) if not
9692 TgSplit execution mode
9693 and vmcnt(0) if TgSplit
9695 - If OpenCL, omit lgkmcnt(0).
9696 - s_waitcnt vmcnt(0)
9699 global/generic load/store/
9700 load atomic/store atomic/
9702 - s_waitcnt lgkmcnt(0)
9709 - Must happen before
9720 2. buffer/global/flat_store sc0=1
9721 store atomic release - workgroup - local *If TgSplit execution mode,
9722 local address space cannot
9726 store atomic release - agent - global 1. buffer_wbl2 sc1=1
9728 - Must happen before
9729 following s_waitcnt.
9730 - Performs L2 writeback to
9734 visible at agent scope.
9736 2. s_waitcnt lgkmcnt(0) &
9739 - If TgSplit execution mode,
9745 - Could be split into
9754 - s_waitcnt vmcnt(0)
9761 - s_waitcnt lgkmcnt(0)
9768 - Must happen before
9779 3. buffer/global/flat_store sc1=1
9780 store atomic release - system - global 1. buffer_wbl2 sc0=1 sc1=1
9782 - Must happen before
9783 following s_waitcnt.
9784 - Performs L2 writeback to
9788 visible at system scope.
9790 2. s_waitcnt lgkmcnt(0) &
9793 - If TgSplit execution mode,
9799 - Could be split into
9808 - s_waitcnt vmcnt(0)
9809 must happen after any
9815 - s_waitcnt lgkmcnt(0)
9816 must happen after any
9822 - Must happen before
9827 to memory and the L2
9834 3. buffer/global/flat_store
9836 atomicrmw release - singlethread - global 1. buffer/global/flat_atomic
9837 - wavefront - generic
9838 atomicrmw release - singlethread - local *If TgSplit execution mode,
9839 - wavefront local address space cannot
9843 atomicrmw release - workgroup - global 1. s_waitcnt lgkm/vmcnt(0)
9845 - Use lgkmcnt(0) if not
9846 TgSplit execution mode
9847 and vmcnt(0) if TgSplit
9851 - s_waitcnt vmcnt(0)
9854 global/generic load/store/
9855 load atomic/store atomic/
9857 - s_waitcnt lgkmcnt(0)
9864 - Must happen before
9875 2. buffer/global/flat_atomic sc0=1
9876 atomicrmw release - workgroup - local *If TgSplit execution mode,
9877 local address space cannot
9881 atomicrmw release - agent - global 1. buffer_wbl2 sc1=1
9883 - Must happen before
9884 following s_waitcnt.
9885 - Performs L2 writeback to
9889 visible at agent scope.
9891 2. s_waitcnt lgkmcnt(0) &
9894 - If TgSplit execution mode,
9898 - Could be split into
9907 - s_waitcnt vmcnt(0)
9914 - s_waitcnt lgkmcnt(0)
9921 - Must happen before
9932 3. buffer/global/flat_atomic sc1=1
9933 atomicrmw release - system - global 1. buffer_wbl2 sc0=1 sc1=1
9935 - Must happen before
9936 following s_waitcnt.
9937 - Performs L2 writeback to
9941 visible at system scope.
9943 2. s_waitcnt lgkmcnt(0) &
9946 - If TgSplit execution mode,
9950 - Could be split into
9959 - s_waitcnt vmcnt(0)
9966 - s_waitcnt lgkmcnt(0)
9973 - Must happen before
9978 to memory and the L2
9985 3. buffer/global/flat_atomic
9987 fence release - singlethread *none* *none*
9989 fence release - workgroup *none* 1. s_waitcnt lgkm/vmcnt(0)
9991 - Use lgkmcnt(0) if not
9992 TgSplit execution mode
9993 and vmcnt(0) if TgSplit
10003 - However, since LLVM
10008 always generate. If
10018 - s_waitcnt vmcnt(0)
10023 load atomic/store atomic/
10025 - s_waitcnt lgkmcnt(0)
10032 - Must happen before
10033 any following store
10037 and memory ordering
10041 fence-paired-atomic).
10048 fence-paired-atomic.
10050 fence release - agent *none* 1. buffer_wbl2 sc1=1
10055 - Must happen before
10056 following s_waitcnt.
10057 - Performs L2 writeback to
10060 store/atomicrmw are
10061 visible at agent scope.
10063 2. s_waitcnt lgkmcnt(0) &
10066 - If TgSplit execution mode,
10076 - However, since LLVM
10081 always generate. If
10091 - Could be split into
10095 lgkmcnt(0) to allow
10097 independently moved
10100 - s_waitcnt vmcnt(0)
10107 - s_waitcnt lgkmcnt(0)
10114 - Must happen before
10115 any following store
10119 and memory ordering
10123 fence-paired-atomic).
10130 fence-paired-atomic.
10132 fence release - system *none* 1. buffer_wbl2 sc0=1 sc1=1
10134 - Must happen before
10135 following s_waitcnt.
10136 - Performs L2 writeback to
10139 store/atomicrmw are
10140 visible at system scope.
10142 2. s_waitcnt lgkmcnt(0) &
10145 - If TgSplit execution mode,
10155 - However, since LLVM
10160 always generate. If
10170 - Could be split into
10174 lgkmcnt(0) to allow
10176 independently moved
10179 - s_waitcnt vmcnt(0)
10186 - s_waitcnt lgkmcnt(0)
10193 - Must happen before
10194 any following store
10198 and memory ordering
10202 fence-paired-atomic).
10209 fence-paired-atomic.
10211 **Acquire-Release Atomic**
10212 ------------------------------------------------------------------------------------
10213 atomicrmw acq_rel - singlethread - global 1. buffer/global/flat_atomic
10214 - wavefront - generic
10215 atomicrmw acq_rel - singlethread - local *If TgSplit execution mode,
10216 - wavefront local address space cannot
10220 atomicrmw acq_rel - workgroup - global 1. s_waitcnt lgkm/vmcnt(0)
10222 - Use lgkmcnt(0) if not
10223 TgSplit execution mode
10224 and vmcnt(0) if TgSplit
10228 - Must happen after
10234 - s_waitcnt vmcnt(0)
10237 global/generic load/store/
10238 load atomic/store atomic/
10240 - s_waitcnt lgkmcnt(0)
10247 - Must happen before
10258 2. buffer/global_atomic
10259 3. s_waitcnt vmcnt(0)
10261 - If not TgSplit execution
10263 - Must happen before
10273 4. buffer_inv sc0=1
10275 - If not TgSplit execution
10282 atomicrmw acq_rel - workgroup - local *If TgSplit execution mode,
10283 local address space cannot
10287 2. s_waitcnt lgkmcnt(0)
10290 - Must happen before
10299 older than the local load
10303 atomicrmw acq_rel - workgroup - generic 1. s_waitcnt lgkm/vmcnt(0)
10305 - Use lgkmcnt(0) if not
10306 TgSplit execution mode
10307 and vmcnt(0) if TgSplit
10311 - s_waitcnt vmcnt(0)
10314 global/generic load/store/
10315 load atomic/store atomic/
10317 - s_waitcnt lgkmcnt(0)
10324 - Must happen before
10336 3. s_waitcnt lgkmcnt(0) &
10339 - If not TgSplit execution
10340 mode, omit vmcnt(0).
10343 - Must happen before
10354 older than a local load
10358 3. buffer_inv sc0=1
10360 - If not TgSplit execution
10367 atomicrmw acq_rel - agent - global 1. buffer_wbl2 sc1=1
10369 - Must happen before
10370 following s_waitcnt.
10371 - Performs L2 writeback to
10374 store/atomicrmw are
10375 visible at agent scope.
10377 2. s_waitcnt lgkmcnt(0) &
10380 - If TgSplit execution mode,
10384 - Could be split into
10388 lgkmcnt(0) to allow
10390 independently moved
10393 - s_waitcnt vmcnt(0)
10400 - s_waitcnt lgkmcnt(0)
10407 - Must happen before
10418 3. buffer/global_atomic
10419 4. s_waitcnt vmcnt(0)
10421 - Must happen before
10430 5. buffer_inv sc1=1
10432 - Must happen before
10442 atomicrmw acq_rel - system - global 1. buffer_wbl2 sc0=1 sc1=1
10444 - Must happen before
10445 following s_waitcnt.
10446 - Performs L2 writeback to
10449 store/atomicrmw are
10450 visible at system scope.
10452 2. s_waitcnt lgkmcnt(0) &
10455 - If TgSplit execution mode,
10459 - Could be split into
10463 lgkmcnt(0) to allow
10465 independently moved
10468 - s_waitcnt vmcnt(0)
10475 - s_waitcnt lgkmcnt(0)
10482 - Must happen before
10487 to global and L2 writeback
10488 have completed before
10493 3. buffer/global_atomic
10495 4. s_waitcnt vmcnt(0)
10497 - Must happen before
10506 5. buffer_inv sc0=1 sc1=1
10508 - Must happen before
10516 MTYPE NC global data.
10517 MTYPE RW and CC memory will
10518 never be stale due to the
10521 atomicrmw acq_rel - agent - generic 1. buffer_wbl2 sc1=1
10523 - Must happen before
10524 following s_waitcnt.
10525 - Performs L2 writeback to
10528 store/atomicrmw are
10529 visible at agent scope.
10531 2. s_waitcnt lgkmcnt(0) &
10534 - If TgSplit execution mode,
10538 - Could be split into
10542 lgkmcnt(0) to allow
10544 independently moved
10547 - s_waitcnt vmcnt(0)
10554 - s_waitcnt lgkmcnt(0)
10561 - Must happen before
10573 4. s_waitcnt vmcnt(0) &
10576 - If TgSplit execution mode,
10580 - Must happen before
10589 5. buffer_inv sc1=1
10591 - Must happen before
10601 atomicrmw acq_rel - system - generic 1. buffer_wbl2 sc0=1 sc1=1
10603 - Must happen before
10604 following s_waitcnt.
10605 - Performs L2 writeback to
10608 store/atomicrmw are
10609 visible at system scope.
10611 2. s_waitcnt lgkmcnt(0) &
10614 - If TgSplit execution mode,
10618 - Could be split into
10622 lgkmcnt(0) to allow
10624 independently moved
10627 - s_waitcnt vmcnt(0)
10634 - s_waitcnt lgkmcnt(0)
10641 - Must happen before
10646 to global and L2 writeback
10647 have completed before
10652 3. flat_atomic sc1=1
10653 4. s_waitcnt vmcnt(0) &
10656 - If TgSplit execution mode,
10660 - Must happen before
10669 5. buffer_inv sc0=1 sc1=1
10671 - Must happen before
10679 MTYPE NC global data.
10680 MTYPE RW and CC memory will
10681 never be stale due to the
10684 fence acq_rel - singlethread *none* *none*
10686 fence acq_rel - workgroup *none* 1. s_waitcnt lgkm/vmcnt(0)
10688 - Use lgkmcnt(0) if not
10689 TgSplit execution mode
10690 and vmcnt(0) if TgSplit
10709 - s_waitcnt vmcnt(0)
10714 load atomic/store atomic/
10716 - s_waitcnt lgkmcnt(0)
10723 - Must happen before
10742 and memory ordering
10746 acquire-fence-paired-atomic)
10759 local/generic store
10763 and memory ordering
10767 release-fence-paired-atomic).
10771 - Must happen before
10775 acquire-fence-paired
10776 atomic has completed
10777 before invalidating
10781 locations read must
10785 acquire-fence-paired-atomic.
10787 3. buffer_inv sc0=1
10789 - If not TgSplit execution
10796 fence acq_rel - agent *none* 1. buffer_wbl2 sc1=1
10801 - Must happen before
10802 following s_waitcnt.
10803 - Performs L2 writeback to
10806 store/atomicrmw are
10807 visible at agent scope.
10809 2. s_waitcnt lgkmcnt(0) &
10812 - If TgSplit execution mode,
10818 - However, since LLVM
10826 - Could be split into
10830 lgkmcnt(0) to allow
10832 independently moved
10835 - s_waitcnt vmcnt(0)
10842 - s_waitcnt lgkmcnt(0)
10849 - Must happen before
10854 global/local/generic
10859 and memory ordering
10863 acquire-fence-paired-atomic)
10865 before invalidating
10875 global/local/generic
10880 and memory ordering
10884 release-fence-paired-atomic).
10889 3. buffer_inv sc1=1
10891 - Must happen before
10905 fence acq_rel - system *none* 1. buffer_wbl2 sc0=1 sc1=1
10910 - Must happen before
10911 following s_waitcnt.
10912 - Performs L2 writeback to
10915 store/atomicrmw are
10916 visible at system scope.
10918 1. s_waitcnt lgkmcnt(0) &
10921 - If TgSplit execution mode,
10927 - However, since LLVM
10935 - Could be split into
10939 lgkmcnt(0) to allow
10941 independently moved
10944 - s_waitcnt vmcnt(0)
10951 - s_waitcnt lgkmcnt(0)
10958 - Must happen before
10963 global/local/generic
10968 and memory ordering
10972 acquire-fence-paired-atomic)
10974 before invalidating
10984 global/local/generic
10989 and memory ordering
10993 release-fence-paired-atomic).
10998 2. buffer_inv sc0=1 sc1=1
11000 - Must happen before
11009 MTYPE NC global data.
11010 MTYPE RW and CC memory will
11011 never be stale due to the
11014 **Sequential Consistent Atomic**
11015 ------------------------------------------------------------------------------------
11016 load atomic seq_cst - singlethread - global *Same as corresponding
11017 - wavefront - local load atomic acquire,
11018 - generic except must generate
11019 all instructions even
11021 load atomic seq_cst - workgroup - global 1. s_waitcnt lgkm/vmcnt(0)
11023 - Use lgkmcnt(0) if not
11024 TgSplit execution mode
11025 and vmcnt(0) if TgSplit
11027 - s_waitcnt lgkmcnt(0) must
11034 ordering of seq_cst
11040 lgkmcnt(0) and so do
11043 - s_waitcnt vmcnt(0)
11046 global/generic load
11050 ordering of seq_cst
11062 consistent global/local
11063 memory instructions
11069 prevents reordering
11072 seq_cst load. (Note
11078 followed by a store
11085 release followed by
11088 order. The s_waitcnt
11089 could be placed after
11090 seq_store or before
11093 make the s_waitcnt be
11094 as late as possible
11100 instructions same as
11103 except must generate
11104 all instructions even
11106 load atomic seq_cst - workgroup - local *If TgSplit execution mode,
11107 local address space cannot
11110 *Same as corresponding
11111 load atomic acquire,
11112 except must generate
11113 all instructions even
11116 load atomic seq_cst - agent - global 1. s_waitcnt lgkmcnt(0) &
11117 - system - generic vmcnt(0)
11119 - If TgSplit execution mode,
11121 - Could be split into
11125 lgkmcnt(0) to allow
11127 independently moved
11130 - s_waitcnt lgkmcnt(0)
11133 global/generic load
11137 ordering of seq_cst
11143 lgkmcnt(0) and so do
11146 - s_waitcnt vmcnt(0)
11149 global/generic load
11153 ordering of seq_cst
11166 memory instructions
11172 prevents reordering
11175 seq_cst load. (Note
11181 followed by a store
11188 release followed by
11191 order. The s_waitcnt
11192 could be placed after
11193 seq_store or before
11196 make the s_waitcnt be
11197 as late as possible
11203 instructions same as
11206 except must generate
11207 all instructions even
11209 store atomic seq_cst - singlethread - global *Same as corresponding
11210 - wavefront - local store atomic release,
11211 - workgroup - generic except must generate
11212 - agent all instructions even
11213 - system for OpenCL.*
11214 atomicrmw seq_cst - singlethread - global *Same as corresponding
11215 - wavefront - local atomicrmw acq_rel,
11216 - workgroup - generic except must generate
11217 - agent all instructions even
11218 - system for OpenCL.*
11219 fence seq_cst - singlethread *none* *Same as corresponding
11220 - wavefront fence acq_rel,
11221 - workgroup except must generate
11222 - agent all instructions even
11223 - system for OpenCL.*
11224 ============ ============ ============== ========== ================================
11226 .. _amdgpu-amdhsa-memory-model-gfx10-gfx11:
11228 Memory Model GFX10-GFX11
11229 ++++++++++++++++++++++++
11233 * Each agent has multiple shader arrays (SA).
11234 * Each SA has multiple work-group processors (WGP).
11235 * Each WGP has multiple compute units (CU).
11236 * Each CU has multiple SIMDs that execute wavefronts.
11237 * The wavefronts for a single work-group are executed in the same
11238 WGP. In CU wavefront execution mode the wavefronts may be executed by
11239 different SIMDs in the same CU. In WGP wavefront execution mode the
11240 wavefronts may be executed by different SIMDs in different CUs in the same
11242 * Each WGP has a single LDS memory shared by the wavefronts of the work-groups
11244 * All LDS operations of a WGP are performed as wavefront wide operations in a
11245 global order and involve no caching. Completion is reported to a wavefront in
11247 * The LDS memory has multiple request queues shared by the SIMDs of a
11248 WGP. Therefore, the LDS operations performed by different wavefronts of a
11249 work-group can be reordered relative to each other, which can result in
11250 reordering the visibility of vector memory operations with respect to LDS
11251 operations of other wavefronts in the same work-group. A ``s_waitcnt
11252 lgkmcnt(0)`` is required to ensure synchronization between LDS operations and
11253 vector memory operations between wavefronts of a work-group, but not between
11254 operations performed by the same wavefront.
11255 * The vector memory operations are performed as wavefront wide operations.
11256 Completion of load/store/sample operations are reported to a wavefront in
11257 execution order of other load/store/sample operations performed by that
11259 * The vector memory operations access a vector L0 cache. There is a single L0
11260 cache per CU. Each SIMD of a CU accesses the same L0 cache. Therefore, no
11261 special action is required for coherence between the lanes of a single
11262 wavefront. However, a ``buffer_gl0_inv`` is required for coherence between
11263 wavefronts executing in the same work-group as they may be executing on SIMDs
11264 of different CUs that access different L0s. A ``buffer_gl0_inv`` is also
11265 required for coherence between wavefronts executing in different work-groups
11266 as they may be executing on different WGPs.
11267 * The scalar memory operations access a scalar L0 cache shared by all wavefronts
11268 on a WGP. The scalar and vector L0 caches are not coherent. However, scalar
11269 operations are used in a restricted way so do not impact the memory model. See
11270 :ref:`amdgpu-amdhsa-memory-spaces`.
11271 * The vector and scalar memory L0 caches use an L1 cache shared by all WGPs on
11272 the same SA. Therefore, no special action is required for coherence between
11273 the wavefronts of a single work-group. However, a ``buffer_gl1_inv`` is
11274 required for coherence between wavefronts executing in different work-groups
11275 as they may be executing on different SAs that access different L1s.
11276 * The L1 caches have independent quadrants to service disjoint ranges of virtual
11278 * Each L0 cache has a separate request queue per L1 quadrant. Therefore, the
11279 vector and scalar memory operations performed by different wavefronts, whether
11280 executing in the same or different work-groups (which may be executing on
11281 different CUs accessing different L0s), can be reordered relative to each
11282 other. A ``s_waitcnt vmcnt(0) & vscnt(0)`` is required to ensure
11283 synchronization between vector memory operations of different wavefronts. It
11284 ensures a previous vector memory operation has completed before executing a
11285 subsequent vector memory or LDS operation and so can be used to meet the
11286 requirements of acquire, release and sequential consistency.
11287 * The L1 caches use an L2 cache shared by all SAs on the same agent.
11288 * The L2 cache has independent channels to service disjoint ranges of virtual
11290 * Each L1 quadrant of a single SA accesses a different L2 channel. Each L1
11291 quadrant has a separate request queue per L2 channel. Therefore, the vector
11292 and scalar memory operations performed by wavefronts executing in different
11293 work-groups (which may be executing on different SAs) of an agent can be
11294 reordered relative to each other. A ``s_waitcnt vmcnt(0) & vscnt(0)`` is
11295 required to ensure synchronization between vector memory operations of
11296 different SAs. It ensures a previous vector memory operation has completed
11297 before executing a subsequent vector memory and so can be used to meet the
11298 requirements of acquire, release and sequential consistency.
11299 * The L2 cache can be kept coherent with other agents on some targets, or ranges
11300 of virtual addresses can be set up to bypass it to ensure system coherence.
11301 * On GFX10.3 and GFX11 a memory attached last level (MALL) cache exists for GPU memory.
11302 The MALL cache is fully coherent with GPU memory and has no impact on system
11303 coherence. All agents (GPU and CPU) access GPU memory through the MALL cache.
11305 Scalar memory operations are only used to access memory that is proven to not
11306 change during the execution of the kernel dispatch. This includes constant
11307 address space and global address space for program scope ``const`` variables.
11308 Therefore, the kernel machine code does not have to maintain the scalar cache to
11309 ensure it is coherent with the vector caches. The scalar and vector caches are
11310 invalidated between kernel dispatches by CP since constant address space data
11311 may change between kernel dispatch executions. See
11312 :ref:`amdgpu-amdhsa-memory-spaces`.
11314 The one exception is if scalar writes are used to spill SGPR registers. In this
11315 case the AMDGPU backend ensures the memory location used to spill is never
11316 accessed by vector memory operations at the same time. If scalar writes are used
11317 then a ``s_dcache_wb`` is inserted before the ``s_endpgm`` and before a function
11318 return since the locations may be used for vector memory instructions by a
11319 future wavefront that uses the same scratch area, or a function call that
11320 creates a frame at the same address, respectively. There is no need for a
11321 ``s_dcache_inv`` as all scalar writes are write-before-read in the same thread.
11323 For kernarg backing memory:
11325 * CP invalidates the L0 and L1 caches at the start of each kernel dispatch.
11326 * On dGPU the kernarg backing memory is accessed as MTYPE UC (uncached) to avoid
11327 needing to invalidate the L2 cache.
11328 * On APU the kernarg backing memory is accessed as MTYPE CC (cache coherent) and
11329 so the L2 cache will be coherent with the CPU and other agents.
11331 Scratch backing memory (which is used for the private address space) is accessed
11332 with MTYPE NC (non-coherent). Since the private address space is only accessed
11333 by a single thread, and is always write-before-read, there is never a need to
11334 invalidate these entries from the L0 or L1 caches.
11336 Wavefronts are executed in native mode with in-order reporting of loads and
11337 sample instructions. In this mode vmcnt reports completion of load, atomic with
11338 return and sample instructions in order, and the vscnt reports the completion of
11339 store and atomic without return in order. See ``MEM_ORDERED`` field in
11340 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx11-table`.
11342 Wavefronts can be executed in WGP or CU wavefront execution mode:
11344 * In WGP wavefront execution mode the wavefronts of a work-group are executed
11345 on the SIMDs of both CUs of the WGP. Therefore, explicit management of the per
11346 CU L0 caches is required for work-group synchronization. Also accesses to L1
11347 at work-group scope need to be explicitly ordered as the accesses from
11348 different CUs are not ordered.
11349 * In CU wavefront execution mode the wavefronts of a work-group are executed on
11350 the SIMDs of a single CU of the WGP. Therefore, all global memory access by
11351 the work-group access the same L0 which in turn ensures L1 accesses are
11352 ordered and so do not require explicit management of the caches for
11353 work-group synchronization.
11355 See ``WGP_MODE`` field in
11356 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx11-table` and
11357 :ref:`amdgpu-target-features`.
11359 The code sequences used to implement the memory model for GFX10-GFX11 are defined in
11360 table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx10-gfx11-table`.
11362 .. table:: AMDHSA Memory Model Code Sequences GFX10-GFX11
11363 :name: amdgpu-amdhsa-memory-model-code-sequences-gfx10-gfx11-table
11365 ============ ============ ============== ========== ================================
11366 LLVM Instr LLVM Memory LLVM Memory AMDGPU AMDGPU Machine Code
11367 Ordering Sync Scope Address GFX10-GFX11
11369 ============ ============ ============== ========== ================================
11371 ------------------------------------------------------------------------------------
11372 load *none* *none* - global - !volatile & !nontemporal
11374 - private 1. buffer/global/flat_load
11376 - !volatile & nontemporal
11378 1. buffer/global/flat_load
11381 - If GFX10, omit dlc=1.
11385 1. buffer/global/flat_load
11388 2. s_waitcnt vmcnt(0)
11390 - Must happen before
11391 any following volatile
11402 load *none* *none* - local 1. ds_load
11403 store *none* *none* - global - !volatile & !nontemporal
11405 - private 1. buffer/global/flat_store
11407 - !volatile & nontemporal
11409 1. buffer/global/flat_store
11412 - If GFX10, omit dlc=1.
11416 1. buffer/global/flat_store
11419 - If GFX10, omit dlc=1.
11421 2. s_waitcnt vscnt(0)
11423 - Must happen before
11424 any following volatile
11435 store *none* *none* - local 1. ds_store
11436 **Unordered Atomic**
11437 ------------------------------------------------------------------------------------
11438 load atomic unordered *any* *any* *Same as non-atomic*.
11439 store atomic unordered *any* *any* *Same as non-atomic*.
11440 atomicrmw unordered *any* *any* *Same as monotonic atomic*.
11441 **Monotonic Atomic**
11442 ------------------------------------------------------------------------------------
11443 load atomic monotonic - singlethread - global 1. buffer/global/flat_load
11444 - wavefront - generic
11445 load atomic monotonic - workgroup - global 1. buffer/global/flat_load
11448 - If CU wavefront execution
11451 load atomic monotonic - singlethread - local 1. ds_load
11454 load atomic monotonic - agent - global 1. buffer/global/flat_load
11455 - system - generic glc=1 dlc=1
11457 - If GFX11, omit dlc=1.
11459 store atomic monotonic - singlethread - global 1. buffer/global/flat_store
11460 - wavefront - generic
11464 store atomic monotonic - singlethread - local 1. ds_store
11467 atomicrmw monotonic - singlethread - global 1. buffer/global/flat_atomic
11468 - wavefront - generic
11472 atomicrmw monotonic - singlethread - local 1. ds_atomic
11476 ------------------------------------------------------------------------------------
11477 load atomic acquire - singlethread - global 1. buffer/global/ds/flat_load
11478 - wavefront - local
11480 load atomic acquire - workgroup - global 1. buffer/global_load glc=1
11482 - If CU wavefront execution
11485 2. s_waitcnt vmcnt(0)
11487 - If CU wavefront execution
11489 - Must happen before
11490 the following buffer_gl0_inv
11491 and before any following
11499 - If CU wavefront execution
11506 load atomic acquire - workgroup - local 1. ds_load
11507 2. s_waitcnt lgkmcnt(0)
11510 - Must happen before
11511 the following buffer_gl0_inv
11512 and before any following
11513 global/generic load/load
11519 older than the local load
11525 - If CU wavefront execution
11533 load atomic acquire - workgroup - generic 1. flat_load glc=1
11535 - If CU wavefront execution
11538 2. s_waitcnt lgkmcnt(0) &
11541 - If CU wavefront execution
11542 mode, omit vmcnt(0).
11545 - Must happen before
11547 buffer_gl0_inv and any
11548 following global/generic
11555 older than a local load
11561 - If CU wavefront execution
11568 load atomic acquire - agent - global 1. buffer/global_load
11569 - system glc=1 dlc=1
11571 - If GFX11, omit dlc=1.
11573 2. s_waitcnt vmcnt(0)
11575 - Must happen before
11580 before invalidating
11586 - Must happen before
11596 load atomic acquire - agent - generic 1. flat_load glc=1 dlc=1
11598 - If GFX11, omit dlc=1.
11600 2. s_waitcnt vmcnt(0) &
11605 - Must happen before
11608 - Ensures the flat_load
11610 before invalidating
11616 - Must happen before
11626 atomicrmw acquire - singlethread - global 1. buffer/global/ds/flat_atomic
11627 - wavefront - local
11629 atomicrmw acquire - workgroup - global 1. buffer/global_atomic
11630 2. s_waitcnt vm/vscnt(0)
11632 - If CU wavefront execution
11634 - Use vmcnt(0) if atomic with
11635 return and vscnt(0) if
11636 atomic with no-return.
11637 - Must happen before
11638 the following buffer_gl0_inv
11639 and before any following
11647 - If CU wavefront execution
11654 atomicrmw acquire - workgroup - local 1. ds_atomic
11655 2. s_waitcnt lgkmcnt(0)
11658 - Must happen before
11664 older than the local
11676 atomicrmw acquire - workgroup - generic 1. flat_atomic
11677 2. s_waitcnt lgkmcnt(0) &
11680 - If CU wavefront execution
11681 mode, omit vm/vscnt(0).
11682 - If OpenCL, omit lgkmcnt(0).
11683 - Use vmcnt(0) if atomic with
11684 return and vscnt(0) if
11685 atomic with no-return.
11686 - Must happen before
11698 - If CU wavefront execution
11705 atomicrmw acquire - agent - global 1. buffer/global_atomic
11706 - system 2. s_waitcnt vm/vscnt(0)
11708 - Use vmcnt(0) if atomic with
11709 return and vscnt(0) if
11710 atomic with no-return.
11711 - Must happen before
11723 - Must happen before
11733 atomicrmw acquire - agent - generic 1. flat_atomic
11734 - system 2. s_waitcnt vm/vscnt(0) &
11739 - Use vmcnt(0) if atomic with
11740 return and vscnt(0) if
11741 atomic with no-return.
11742 - Must happen before
11754 - Must happen before
11764 fence acquire - singlethread *none* *none*
11766 fence acquire - workgroup *none* 1. s_waitcnt lgkmcnt(0) &
11767 vmcnt(0) & vscnt(0)
11769 - If CU wavefront execution
11770 mode, omit vmcnt(0) and
11779 vmcnt(0) and vscnt(0).
11780 - However, since LLVM
11785 always generate. If
11795 - Could be split into
11797 vmcnt(0), s_waitcnt
11798 vscnt(0) and s_waitcnt
11799 lgkmcnt(0) to allow
11801 independently moved
11804 - s_waitcnt vmcnt(0)
11807 global/generic load
11809 atomicrmw-with-return-value
11812 and memory ordering
11816 fence-paired-atomic).
11817 - s_waitcnt vscnt(0)
11821 atomicrmw-no-return-value
11824 and memory ordering
11828 fence-paired-atomic).
11829 - s_waitcnt lgkmcnt(0)
11836 and memory ordering
11840 fence-paired-atomic).
11841 - Must happen before
11845 fence-paired atomic
11847 before invalidating
11851 locations read must
11855 fence-paired-atomic.
11859 - If CU wavefront execution
11866 fence acquire - agent *none* 1. s_waitcnt lgkmcnt(0) &
11867 - system vmcnt(0) & vscnt(0)
11876 vmcnt(0) and vscnt(0).
11877 - However, since LLVM
11885 - Could be split into
11887 vmcnt(0), s_waitcnt
11888 vscnt(0) and s_waitcnt
11889 lgkmcnt(0) to allow
11891 independently moved
11894 - s_waitcnt vmcnt(0)
11897 global/generic load
11899 atomicrmw-with-return-value
11902 and memory ordering
11906 fence-paired-atomic).
11907 - s_waitcnt vscnt(0)
11911 atomicrmw-no-return-value
11914 and memory ordering
11918 fence-paired-atomic).
11919 - s_waitcnt lgkmcnt(0)
11926 and memory ordering
11930 fence-paired-atomic).
11931 - Must happen before
11935 fence-paired atomic
11937 before invalidating
11941 locations read must
11945 fence-paired-atomic.
11950 - Must happen before any
11951 following global/generic
11961 ------------------------------------------------------------------------------------
11962 store atomic release - singlethread - global 1. buffer/global/ds/flat_store
11963 - wavefront - local
11965 store atomic release - workgroup - global 1. s_waitcnt lgkmcnt(0) &
11966 - generic vmcnt(0) & vscnt(0)
11968 - If CU wavefront execution
11969 mode, omit vmcnt(0) and
11973 - Could be split into
11975 vmcnt(0), s_waitcnt
11976 vscnt(0) and s_waitcnt
11977 lgkmcnt(0) to allow
11979 independently moved
11982 - s_waitcnt vmcnt(0)
11985 global/generic load/load
11987 atomicrmw-with-return-value.
11988 - s_waitcnt vscnt(0)
11994 atomicrmw-no-return-value.
11995 - s_waitcnt lgkmcnt(0)
12002 - Must happen before
12010 store that is being
12013 2. buffer/global/flat_store
12014 store atomic release - workgroup - local 1. s_waitcnt vmcnt(0) & vscnt(0)
12016 - If CU wavefront execution
12019 - Could be split into
12021 vmcnt(0) and s_waitcnt
12024 independently moved
12027 - s_waitcnt vmcnt(0)
12030 global/generic load/load
12032 atomicrmw-with-return-value.
12033 - s_waitcnt vscnt(0)
12037 store/store atomic/
12038 atomicrmw-no-return-value.
12039 - Must happen before
12047 store that is being
12051 store atomic release - agent - global 1. s_waitcnt lgkmcnt(0) &
12052 - system - generic vmcnt(0) & vscnt(0)
12058 - Could be split into
12060 vmcnt(0), s_waitcnt vscnt(0)
12062 lgkmcnt(0) to allow
12064 independently moved
12067 - s_waitcnt vmcnt(0)
12073 atomicrmw-with-return-value.
12074 - s_waitcnt vscnt(0)
12078 store/store atomic/
12079 atomicrmw-no-return-value.
12080 - s_waitcnt lgkmcnt(0)
12087 - Must happen before
12095 store that is being
12098 2. buffer/global/flat_store
12099 atomicrmw release - singlethread - global 1. buffer/global/ds/flat_atomic
12100 - wavefront - local
12102 atomicrmw release - workgroup - global 1. s_waitcnt lgkmcnt(0) &
12103 - generic vmcnt(0) & vscnt(0)
12105 - If CU wavefront execution
12106 mode, omit vmcnt(0) and
12108 - If OpenCL, omit lgkmcnt(0).
12109 - Could be split into
12111 vmcnt(0), s_waitcnt
12112 vscnt(0) and s_waitcnt
12113 lgkmcnt(0) to allow
12115 independently moved
12118 - s_waitcnt vmcnt(0)
12121 global/generic load/load
12123 atomicrmw-with-return-value.
12124 - s_waitcnt vscnt(0)
12130 atomicrmw-no-return-value.
12131 - s_waitcnt lgkmcnt(0)
12138 - Must happen before
12149 2. buffer/global/flat_atomic
12150 atomicrmw release - workgroup - local 1. s_waitcnt vmcnt(0) & vscnt(0)
12152 - If CU wavefront execution
12155 - Could be split into
12157 vmcnt(0) and s_waitcnt
12160 independently moved
12163 - s_waitcnt vmcnt(0)
12166 global/generic load/load
12168 atomicrmw-with-return-value.
12169 - s_waitcnt vscnt(0)
12173 store/store atomic/
12174 atomicrmw-no-return-value.
12175 - Must happen before
12183 store that is being
12187 atomicrmw release - agent - global 1. s_waitcnt lgkmcnt(0) &
12188 - system - generic vmcnt(0) & vscnt(0)
12192 - Could be split into
12194 vmcnt(0), s_waitcnt
12195 vscnt(0) and s_waitcnt
12196 lgkmcnt(0) to allow
12198 independently moved
12201 - s_waitcnt vmcnt(0)
12206 atomicrmw-with-return-value.
12207 - s_waitcnt vscnt(0)
12211 store/store atomic/
12212 atomicrmw-no-return-value.
12213 - s_waitcnt lgkmcnt(0)
12220 - Must happen before
12225 to global and local
12231 2. buffer/global/flat_atomic
12232 fence release - singlethread *none* *none*
12234 fence release - workgroup *none* 1. s_waitcnt lgkmcnt(0) &
12235 vmcnt(0) & vscnt(0)
12237 - If CU wavefront execution
12238 mode, omit vmcnt(0) and
12247 vmcnt(0) and vscnt(0).
12248 - However, since LLVM
12253 always generate. If
12263 - Could be split into
12265 vmcnt(0), s_waitcnt
12266 vscnt(0) and s_waitcnt
12267 lgkmcnt(0) to allow
12269 independently moved
12272 - s_waitcnt vmcnt(0)
12278 atomicrmw-with-return-value.
12279 - s_waitcnt vscnt(0)
12283 store/store atomic/
12284 atomicrmw-no-return-value.
12285 - s_waitcnt lgkmcnt(0)
12290 atomic/store atomic/
12292 - Must happen before
12293 any following store
12297 and memory ordering
12301 fence-paired-atomic).
12308 fence-paired-atomic.
12310 fence release - agent *none* 1. s_waitcnt lgkmcnt(0) &
12311 - system vmcnt(0) & vscnt(0)
12320 vmcnt(0) and vscnt(0).
12321 - However, since LLVM
12326 always generate. If
12336 - Could be split into
12338 vmcnt(0), s_waitcnt
12339 vscnt(0) and s_waitcnt
12340 lgkmcnt(0) to allow
12342 independently moved
12345 - s_waitcnt vmcnt(0)
12350 atomicrmw-with-return-value.
12351 - s_waitcnt vscnt(0)
12355 store/store atomic/
12356 atomicrmw-no-return-value.
12357 - s_waitcnt lgkmcnt(0)
12364 - Must happen before
12365 any following store
12369 and memory ordering
12373 fence-paired-atomic).
12380 fence-paired-atomic.
12382 **Acquire-Release Atomic**
12383 ------------------------------------------------------------------------------------
12384 atomicrmw acq_rel - singlethread - global 1. buffer/global/ds/flat_atomic
12385 - wavefront - local
12387 atomicrmw acq_rel - workgroup - global 1. s_waitcnt lgkmcnt(0) &
12388 vmcnt(0) & vscnt(0)
12390 - If CU wavefront execution
12391 mode, omit vmcnt(0) and
12395 - Must happen after
12401 - Could be split into
12403 vmcnt(0), s_waitcnt
12404 vscnt(0), and s_waitcnt
12405 lgkmcnt(0) to allow
12407 independently moved
12410 - s_waitcnt vmcnt(0)
12413 global/generic load/load
12415 atomicrmw-with-return-value.
12416 - s_waitcnt vscnt(0)
12422 atomicrmw-no-return-value.
12423 - s_waitcnt lgkmcnt(0)
12430 - Must happen before
12441 2. buffer/global_atomic
12442 3. s_waitcnt vm/vscnt(0)
12444 - If CU wavefront execution
12446 - Use vmcnt(0) if atomic with
12447 return and vscnt(0) if
12448 atomic with no-return.
12449 - Must happen before
12461 - If CU wavefront execution
12468 atomicrmw acq_rel - workgroup - local 1. s_waitcnt vmcnt(0) & vscnt(0)
12470 - If CU wavefront execution
12473 - Could be split into
12475 vmcnt(0) and s_waitcnt
12478 independently moved
12481 - s_waitcnt vmcnt(0)
12484 global/generic load/load
12486 atomicrmw-with-return-value.
12487 - s_waitcnt vscnt(0)
12491 store/store atomic/
12492 atomicrmw-no-return-value.
12493 - Must happen before
12501 store that is being
12505 3. s_waitcnt lgkmcnt(0)
12508 - Must happen before
12514 older than the local load
12520 - If CU wavefront execution
12528 atomicrmw acq_rel - workgroup - generic 1. s_waitcnt lgkmcnt(0) &
12529 vmcnt(0) & vscnt(0)
12531 - If CU wavefront execution
12532 mode, omit vmcnt(0) and
12534 - If OpenCL, omit lgkmcnt(0).
12535 - Could be split into
12537 vmcnt(0), s_waitcnt
12538 vscnt(0) and s_waitcnt
12539 lgkmcnt(0) to allow
12541 independently moved
12544 - s_waitcnt vmcnt(0)
12547 global/generic load/load
12549 atomicrmw-with-return-value.
12550 - s_waitcnt vscnt(0)
12556 atomicrmw-no-return-value.
12557 - s_waitcnt lgkmcnt(0)
12564 - Must happen before
12576 3. s_waitcnt lgkmcnt(0) &
12577 vmcnt(0) & vscnt(0)
12579 - If CU wavefront execution
12580 mode, omit vmcnt(0) and
12582 - If OpenCL, omit lgkmcnt(0).
12583 - Must happen before
12589 older than the load
12595 - If CU wavefront execution
12602 atomicrmw acq_rel - agent - global 1. s_waitcnt lgkmcnt(0) &
12603 - system vmcnt(0) & vscnt(0)
12607 - Could be split into
12609 vmcnt(0), s_waitcnt
12610 vscnt(0) and s_waitcnt
12611 lgkmcnt(0) to allow
12613 independently moved
12616 - s_waitcnt vmcnt(0)
12621 atomicrmw-with-return-value.
12622 - s_waitcnt vscnt(0)
12626 store/store atomic/
12627 atomicrmw-no-return-value.
12628 - s_waitcnt lgkmcnt(0)
12635 - Must happen before
12646 2. buffer/global_atomic
12647 3. s_waitcnt vm/vscnt(0)
12649 - Use vmcnt(0) if atomic with
12650 return and vscnt(0) if
12651 atomic with no-return.
12652 - Must happen before
12664 - Must happen before
12674 atomicrmw acq_rel - agent - generic 1. s_waitcnt lgkmcnt(0) &
12675 - system vmcnt(0) & vscnt(0)
12679 - Could be split into
12681 vmcnt(0), s_waitcnt
12682 vscnt(0), and s_waitcnt
12683 lgkmcnt(0) to allow
12685 independently moved
12688 - s_waitcnt vmcnt(0)
12693 atomicrmw-with-return-value.
12694 - s_waitcnt vscnt(0)
12698 store/store atomic/
12699 atomicrmw-no-return-value.
12700 - s_waitcnt lgkmcnt(0)
12707 - Must happen before
12719 3. s_waitcnt vm/vscnt(0) &
12724 - Use vmcnt(0) if atomic with
12725 return and vscnt(0) if
12726 atomic with no-return.
12727 - Must happen before
12739 - Must happen before
12749 fence acq_rel - singlethread *none* *none*
12751 fence acq_rel - workgroup *none* 1. s_waitcnt lgkmcnt(0) &
12752 vmcnt(0) & vscnt(0)
12754 - If CU wavefront execution
12755 mode, omit vmcnt(0) and
12764 vmcnt(0) and vscnt(0).
12774 - Could be split into
12776 vmcnt(0), s_waitcnt
12777 vscnt(0) and s_waitcnt
12778 lgkmcnt(0) to allow
12780 independently moved
12783 - s_waitcnt vmcnt(0)
12789 atomicrmw-with-return-value.
12790 - s_waitcnt vscnt(0)
12794 store/store atomic/
12795 atomicrmw-no-return-value.
12796 - s_waitcnt lgkmcnt(0)
12801 atomic/store atomic/
12803 - Must happen before
12822 and memory ordering
12826 acquire-fence-paired-atomic)
12839 local/generic store
12843 and memory ordering
12847 release-fence-paired-atomic).
12851 - Must happen before
12855 acquire-fence-paired
12856 atomic has completed
12857 before invalidating
12861 locations read must
12865 acquire-fence-paired-atomic.
12869 - If CU wavefront execution
12876 fence acq_rel - agent *none* 1. s_waitcnt lgkmcnt(0) &
12877 - system vmcnt(0) & vscnt(0)
12886 vmcnt(0) and vscnt(0).
12887 - However, since LLVM
12895 - Could be split into
12897 vmcnt(0), s_waitcnt
12898 vscnt(0) and s_waitcnt
12899 lgkmcnt(0) to allow
12901 independently moved
12904 - s_waitcnt vmcnt(0)
12910 atomicrmw-with-return-value.
12911 - s_waitcnt vscnt(0)
12915 store/store atomic/
12916 atomicrmw-no-return-value.
12917 - s_waitcnt lgkmcnt(0)
12924 - Must happen before
12929 global/local/generic
12934 and memory ordering
12938 acquire-fence-paired-atomic)
12940 before invalidating
12950 global/local/generic
12955 and memory ordering
12959 release-fence-paired-atomic).
12967 - Must happen before
12981 **Sequential Consistent Atomic**
12982 ------------------------------------------------------------------------------------
12983 load atomic seq_cst - singlethread - global *Same as corresponding
12984 - wavefront - local load atomic acquire,
12985 - generic except must generate
12986 all instructions even
12988 load atomic seq_cst - workgroup - global 1. s_waitcnt lgkmcnt(0) &
12989 - generic vmcnt(0) & vscnt(0)
12991 - If CU wavefront execution
12992 mode, omit vmcnt(0) and
12994 - Could be split into
12996 vmcnt(0), s_waitcnt
12997 vscnt(0), and s_waitcnt
12998 lgkmcnt(0) to allow
13000 independently moved
13003 - s_waitcnt lgkmcnt(0) must
13010 ordering of seq_cst
13016 lgkmcnt(0) and so do
13019 - s_waitcnt vmcnt(0)
13022 global/generic load
13024 atomicrmw-with-return-value
13026 ordering of seq_cst
13035 - s_waitcnt vscnt(0)
13038 global/generic store
13040 atomicrmw-no-return-value
13042 ordering of seq_cst
13054 consistent global/local
13055 memory instructions
13061 prevents reordering
13064 seq_cst load. (Note
13070 followed by a store
13077 release followed by
13080 order. The s_waitcnt
13081 could be placed after
13082 seq_store or before
13085 make the s_waitcnt be
13086 as late as possible
13092 instructions same as
13095 except must generate
13096 all instructions even
13098 load atomic seq_cst - workgroup - local
13100 1. s_waitcnt vmcnt(0) & vscnt(0)
13102 - If CU wavefront execution
13104 - Could be split into
13106 vmcnt(0) and s_waitcnt
13109 independently moved
13112 - s_waitcnt vmcnt(0)
13115 global/generic load
13117 atomicrmw-with-return-value
13119 ordering of seq_cst
13128 - s_waitcnt vscnt(0)
13131 global/generic store
13133 atomicrmw-no-return-value
13135 ordering of seq_cst
13148 memory instructions
13154 prevents reordering
13157 seq_cst load. (Note
13163 followed by a store
13170 release followed by
13173 order. The s_waitcnt
13174 could be placed after
13175 seq_store or before
13178 make the s_waitcnt be
13179 as late as possible
13185 instructions same as
13188 except must generate
13189 all instructions even
13192 load atomic seq_cst - agent - global 1. s_waitcnt lgkmcnt(0) &
13193 - system - generic vmcnt(0) & vscnt(0)
13195 - Could be split into
13197 vmcnt(0), s_waitcnt
13198 vscnt(0) and s_waitcnt
13199 lgkmcnt(0) to allow
13201 independently moved
13204 - s_waitcnt lgkmcnt(0)
13211 ordering of seq_cst
13217 lgkmcnt(0) and so do
13220 - s_waitcnt vmcnt(0)
13223 global/generic load
13225 atomicrmw-with-return-value
13227 ordering of seq_cst
13236 - s_waitcnt vscnt(0)
13239 global/generic store
13241 atomicrmw-no-return-value
13243 ordering of seq_cst
13256 memory instructions
13262 prevents reordering
13265 seq_cst load. (Note
13271 followed by a store
13278 release followed by
13281 order. The s_waitcnt
13282 could be placed after
13283 seq_store or before
13286 make the s_waitcnt be
13287 as late as possible
13293 instructions same as
13296 except must generate
13297 all instructions even
13299 store atomic seq_cst - singlethread - global *Same as corresponding
13300 - wavefront - local store atomic release,
13301 - workgroup - generic except must generate
13302 - agent all instructions even
13303 - system for OpenCL.*
13304 atomicrmw seq_cst - singlethread - global *Same as corresponding
13305 - wavefront - local atomicrmw acq_rel,
13306 - workgroup - generic except must generate
13307 - agent all instructions even
13308 - system for OpenCL.*
13309 fence seq_cst - singlethread *none* *Same as corresponding
13310 - wavefront fence acq_rel,
13311 - workgroup except must generate
13312 - agent all instructions even
13313 - system for OpenCL.*
13314 ============ ============ ============== ========== ================================
13316 .. _amdgpu-amdhsa-trap-handler-abi:
13321 For code objects generated by the AMDGPU backend for HSA [HSA]_ compatible
13322 runtimes (see :ref:`amdgpu-os`), the runtime installs a trap handler that
13323 supports the ``s_trap`` instruction. For usage see:
13325 - :ref:`amdgpu-trap-handler-for-amdhsa-os-v2-table`
13326 - :ref:`amdgpu-trap-handler-for-amdhsa-os-v3-table`
13327 - :ref:`amdgpu-trap-handler-for-amdhsa-os-v4-onwards-table`
13329 .. table:: AMDGPU Trap Handler for AMDHSA OS Code Object V2
13330 :name: amdgpu-trap-handler-for-amdhsa-os-v2-table
13332 =================== =============== =============== =======================================
13333 Usage Code Sequence Trap Handler Description
13335 =================== =============== =============== =======================================
13336 reserved ``s_trap 0x00`` Reserved by hardware.
13337 ``debugtrap(arg)`` ``s_trap 0x01`` ``SGPR0-1``: Reserved for Finalizer HSA ``debugtrap``
13338 ``queue_ptr`` intrinsic (not implemented).
13341 ``llvm.trap`` ``s_trap 0x02`` ``SGPR0-1``: Causes wave to be halted with the PC at
13342 ``queue_ptr`` the trap instruction. The associated
13343 queue is signalled to put it into the
13344 error state. When the queue is put in
13345 the error state, the waves executing
13346 dispatches on the queue will be
13348 ``llvm.debugtrap`` ``s_trap 0x03`` *none* - If debugger not enabled then behaves
13349 as a no-operation. The trap handler
13350 is entered and immediately returns to
13351 continue execution of the wavefront.
13352 - If the debugger is enabled, causes
13353 the debug trap to be reported by the
13354 debugger and the wavefront is put in
13355 the halt state with the PC at the
13356 instruction. The debugger must
13357 increment the PC and resume the wave.
13358 reserved ``s_trap 0x04`` Reserved.
13359 reserved ``s_trap 0x05`` Reserved.
13360 reserved ``s_trap 0x06`` Reserved.
13361 reserved ``s_trap 0x07`` Reserved.
13362 reserved ``s_trap 0x08`` Reserved.
13363 reserved ``s_trap 0xfe`` Reserved.
13364 reserved ``s_trap 0xff`` Reserved.
13365 =================== =============== =============== =======================================
13369 .. table:: AMDGPU Trap Handler for AMDHSA OS Code Object V3
13370 :name: amdgpu-trap-handler-for-amdhsa-os-v3-table
13372 =================== =============== =============== =======================================
13373 Usage Code Sequence Trap Handler Description
13375 =================== =============== =============== =======================================
13376 reserved ``s_trap 0x00`` Reserved by hardware.
13377 debugger breakpoint ``s_trap 0x01`` *none* Reserved for debugger to use for
13378 breakpoints. Causes wave to be halted
13379 with the PC at the trap instruction.
13380 The debugger is responsible to resume
13381 the wave, including the instruction
13382 that the breakpoint overwrote.
13383 ``llvm.trap`` ``s_trap 0x02`` ``SGPR0-1``: Causes wave to be halted with the PC at
13384 ``queue_ptr`` the trap instruction. The associated
13385 queue is signalled to put it into the
13386 error state. When the queue is put in
13387 the error state, the waves executing
13388 dispatches on the queue will be
13390 ``llvm.debugtrap`` ``s_trap 0x03`` *none* - If debugger not enabled then behaves
13391 as a no-operation. The trap handler
13392 is entered and immediately returns to
13393 continue execution of the wavefront.
13394 - If the debugger is enabled, causes
13395 the debug trap to be reported by the
13396 debugger and the wavefront is put in
13397 the halt state with the PC at the
13398 instruction. The debugger must
13399 increment the PC and resume the wave.
13400 reserved ``s_trap 0x04`` Reserved.
13401 reserved ``s_trap 0x05`` Reserved.
13402 reserved ``s_trap 0x06`` Reserved.
13403 reserved ``s_trap 0x07`` Reserved.
13404 reserved ``s_trap 0x08`` Reserved.
13405 reserved ``s_trap 0xfe`` Reserved.
13406 reserved ``s_trap 0xff`` Reserved.
13407 =================== =============== =============== =======================================
13411 .. table:: AMDGPU Trap Handler for AMDHSA OS Code Object V4 and Above
13412 :name: amdgpu-trap-handler-for-amdhsa-os-v4-onwards-table
13414 =================== =============== ================ ================= =======================================
13415 Usage Code Sequence GFX6-GFX8 Inputs GFX9-GFX11 Inputs Description
13416 =================== =============== ================ ================= =======================================
13417 reserved ``s_trap 0x00`` Reserved by hardware.
13418 debugger breakpoint ``s_trap 0x01`` *none* *none* Reserved for debugger to use for
13419 breakpoints. Causes wave to be halted
13420 with the PC at the trap instruction.
13421 The debugger is responsible to resume
13422 the wave, including the instruction
13423 that the breakpoint overwrote.
13424 ``llvm.trap`` ``s_trap 0x02`` ``SGPR0-1``: *none* Causes wave to be halted with the PC at
13425 ``queue_ptr`` the trap instruction. The associated
13426 queue is signalled to put it into the
13427 error state. When the queue is put in
13428 the error state, the waves executing
13429 dispatches on the queue will be
13431 ``llvm.debugtrap`` ``s_trap 0x03`` *none* *none* - If debugger not enabled then behaves
13432 as a no-operation. The trap handler
13433 is entered and immediately returns to
13434 continue execution of the wavefront.
13435 - If the debugger is enabled, causes
13436 the debug trap to be reported by the
13437 debugger and the wavefront is put in
13438 the halt state with the PC at the
13439 instruction. The debugger must
13440 increment the PC and resume the wave.
13441 reserved ``s_trap 0x04`` Reserved.
13442 reserved ``s_trap 0x05`` Reserved.
13443 reserved ``s_trap 0x06`` Reserved.
13444 reserved ``s_trap 0x07`` Reserved.
13445 reserved ``s_trap 0x08`` Reserved.
13446 reserved ``s_trap 0xfe`` Reserved.
13447 reserved ``s_trap 0xff`` Reserved.
13448 =================== =============== ================ ================= =======================================
13450 .. _amdgpu-amdhsa-function-call-convention:
13457 This section is currently incomplete and has inaccuracies. It is WIP that will
13458 be updated as information is determined.
13460 See :ref:`amdgpu-dwarf-address-space-identifier` for information on swizzled
13461 addresses. Unswizzled addresses are normal linear addresses.
13463 .. _amdgpu-amdhsa-function-call-convention-kernel-functions:
13468 This section describes the call convention ABI for the outer kernel function.
13470 See :ref:`amdgpu-amdhsa-initial-kernel-execution-state` for the kernel call
13473 The following is not part of the AMDGPU kernel calling convention but describes
13474 how the AMDGPU implements function calls:
13476 1. Clang decides the kernarg layout to match the *HSA Programmer's Language
13479 - All structs are passed directly.
13480 - Lambda values are passed *TBA*.
13484 - Does this really follow HSA rules? Or are structs >16 bytes passed
13486 - What is ABI for lambda values?
13488 4. The kernel performs certain setup in its prolog, as described in
13489 :ref:`amdgpu-amdhsa-kernel-prolog`.
13491 .. _amdgpu-amdhsa-function-call-convention-non-kernel-functions:
13493 Non-Kernel Functions
13494 ++++++++++++++++++++
13496 This section describes the call convention ABI for functions other than the
13497 outer kernel function.
13499 If a kernel has function calls then scratch is always allocated and used for
13500 the call stack which grows from low address to high address using the swizzled
13501 scratch address space.
13503 On entry to a function:
13505 1. SGPR0-3 contain a V# with the following properties (see
13506 :ref:`amdgpu-amdhsa-kernel-prolog-private-segment-buffer`):
13508 * Base address pointing to the beginning of the wavefront scratch backing
13510 * Swizzled with dword element size and stride of wavefront size elements.
13512 2. The FLAT_SCRATCH register pair is setup. See
13513 :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`.
13514 3. GFX6-GFX8: M0 register set to the size of LDS in bytes. See
13515 :ref:`amdgpu-amdhsa-kernel-prolog-m0`.
13516 4. The EXEC register is set to the lanes active on entry to the function.
13517 5. MODE register: *TBD*
13518 6. VGPR0-31 and SGPR4-29 are used to pass function input arguments as described
13520 7. SGPR30-31 return address (RA). The code address that the function must
13521 return to when it completes. The value is undefined if the function is *no
13523 8. SGPR32 is used for the stack pointer (SP). It is an unswizzled scratch
13524 offset relative to the beginning of the wavefront scratch backing memory.
13526 The unswizzled SP can be used with buffer instructions as an unswizzled SGPR
13527 offset with the scratch V# in SGPR0-3 to access the stack in a swizzled
13530 The unswizzled SP value can be converted into the swizzled SP value by:
13532 | swizzled SP = unswizzled SP / wavefront size
13534 This may be used to obtain the private address space address of stack
13535 objects and to convert this address to a flat address by adding the flat
13536 scratch aperture base address.
13538 The swizzled SP value is always 4 bytes aligned for the ``r600``
13539 architecture and 16 byte aligned for the ``amdgcn`` architecture.
13543 The ``amdgcn`` value is selected to avoid dynamic stack alignment for the
13544 OpenCL language which has the largest base type defined as 16 bytes.
13546 On entry, the swizzled SP value is the address of the first function
13547 argument passed on the stack. Other stack passed arguments are positive
13548 offsets from the entry swizzled SP value.
13550 The function may use positive offsets beyond the last stack passed argument
13551 for stack allocated local variables and register spill slots. If necessary,
13552 the function may align these to greater alignment than 16 bytes. After these
13553 the function may dynamically allocate space for such things as runtime sized
13554 ``alloca`` local allocations.
13556 If the function calls another function, it will place any stack allocated
13557 arguments after the last local allocation and adjust SGPR32 to the address
13558 after the last local allocation.
13560 9. All other registers are unspecified.
13561 10. Any necessary ``s_waitcnt`` has been performed to ensure memory is available
13564 On exit from a function:
13566 1. VGPR0-31 and SGPR4-29 are used to pass function result arguments as
13567 described below. Any registers used are considered clobbered registers.
13568 2. The following registers are preserved and have the same value as on entry:
13573 * All SGPR registers except the clobbered registers of SGPR4-31.
13591 Except the argument registers, the VGPRs clobbered and the preserved
13592 registers are intermixed at regular intervals in order to keep a
13593 similar ratio independent of the number of allocated VGPRs.
13595 * GFX90A: All AGPR registers except the clobbered registers AGPR0-31.
13596 * Lanes of all VGPRs that are inactive at the call site.
13598 For the AMDGPU backend, an inter-procedural register allocation (IPRA)
13599 optimization may mark some of clobbered SGPR and VGPR registers as
13600 preserved if it can be determined that the called function does not change
13603 2. The PC is set to the RA provided on entry.
13604 3. MODE register: *TBD*.
13605 4. All other registers are clobbered.
13606 5. Any necessary ``s_waitcnt`` has been performed to ensure memory accessed by
13607 function is available to the caller.
13611 - How are function results returned? The address of structured types is passed
13612 by reference, but what about other types?
13614 The function input arguments are made up of the formal arguments explicitly
13615 declared by the source language function plus the implicit input arguments used
13616 by the implementation.
13618 The source language input arguments are:
13620 1. Any source language implicit ``this`` or ``self`` argument comes first as a
13622 2. Followed by the function formal arguments in left to right source order.
13624 The source language result arguments are:
13626 1. The function result argument.
13628 The source language input or result struct type arguments that are less than or
13629 equal to 16 bytes, are decomposed recursively into their base type fields, and
13630 each field is passed as if a separate argument. For input arguments, if the
13631 called function requires the struct to be in memory, for example because its
13632 address is taken, then the function body is responsible for allocating a stack
13633 location and copying the field arguments into it. Clang terms this *direct
13636 The source language input struct type arguments that are greater than 16 bytes,
13637 are passed by reference. The caller is responsible for allocating a stack
13638 location to make a copy of the struct value and pass the address as the input
13639 argument. The called function is responsible to perform the dereference when
13640 accessing the input argument. Clang terms this *by-value struct*.
13642 A source language result struct type argument that is greater than 16 bytes, is
13643 returned by reference. The caller is responsible for allocating a stack location
13644 to hold the result value and passes the address as the last input argument
13645 (before the implicit input arguments). In this case there are no result
13646 arguments. The called function is responsible to perform the dereference when
13647 storing the result value. Clang terms this *structured return (sret)*.
13649 *TODO: correct the ``sret`` definition.*
13653 Is this definition correct? Or is ``sret`` only used if passing in registers, and
13654 pass as non-decomposed struct as stack argument? Or something else? Is the
13655 memory location in the caller stack frame, or a stack memory argument and so
13656 no address is passed as the caller can directly write to the argument stack
13657 location? But then the stack location is still live after return. If an
13658 argument stack location is it the first stack argument or the last one?
13660 Lambda argument types are treated as struct types with an implementation defined
13665 Need to specify the ABI for lambda types for AMDGPU.
13667 For AMDGPU backend all source language arguments (including the decomposed
13668 struct type arguments) are passed in VGPRs unless marked ``inreg`` in which case
13669 they are passed in SGPRs.
13671 The AMDGPU backend walks the function call graph from the leaves to determine
13672 which implicit input arguments are used, propagating to each caller of the
13673 function. The used implicit arguments are appended to the function arguments
13674 after the source language arguments in the following order:
13678 Is recursion or external functions supported?
13680 1. Work-Item ID (1 VGPR)
13682 The X, Y and Z work-item ID are packed into a single VGRP with the following
13683 layout. Only fields actually used by the function are set. The other bits
13686 The values come from the initial kernel execution state. See
13687 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`.
13689 .. table:: Work-item implicit argument layout
13690 :name: amdgpu-amdhsa-workitem-implicit-argument-layout-table
13692 ======= ======= ==============
13693 Bits Size Field Name
13694 ======= ======= ==============
13695 9:0 10 bits X Work-Item ID
13696 19:10 10 bits Y Work-Item ID
13697 29:20 10 bits Z Work-Item ID
13698 31:30 2 bits Unused
13699 ======= ======= ==============
13701 2. Dispatch Ptr (2 SGPRs)
13703 The value comes from the initial kernel execution state. See
13704 :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
13706 3. Queue Ptr (2 SGPRs)
13708 The value comes from the initial kernel execution state. See
13709 :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
13711 4. Kernarg Segment Ptr (2 SGPRs)
13713 The value comes from the initial kernel execution state. See
13714 :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
13716 5. Dispatch id (2 SGPRs)
13718 The value comes from the initial kernel execution state. See
13719 :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
13721 6. Work-Group ID X (1 SGPR)
13723 The value comes from the initial kernel execution state. See
13724 :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
13726 7. Work-Group ID Y (1 SGPR)
13728 The value comes from the initial kernel execution state. See
13729 :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
13731 8. Work-Group ID Z (1 SGPR)
13733 The value comes from the initial kernel execution state. See
13734 :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
13736 9. Implicit Argument Ptr (2 SGPRs)
13738 The value is computed by adding an offset to Kernarg Segment Ptr to get the
13739 global address space pointer to the first kernarg implicit argument.
13741 The input and result arguments are assigned in order in the following manner:
13745 There are likely some errors and omissions in the following description that
13750 Check the Clang source code to decipher how function arguments and return
13751 results are handled. Also see the AMDGPU specific values used.
13753 * VGPR arguments are assigned to consecutive VGPRs starting at VGPR0 up to
13756 If there are more arguments than will fit in these registers, the remaining
13757 arguments are allocated on the stack in order on naturally aligned
13762 How are overly aligned structures allocated on the stack?
13764 * SGPR arguments are assigned to consecutive SGPRs starting at SGPR0 up to
13767 If there are more arguments than will fit in these registers, the remaining
13768 arguments are allocated on the stack in order on naturally aligned
13771 Note that decomposed struct type arguments may have some fields passed in
13772 registers and some in memory.
13776 So, a struct which can pass some fields as decomposed register arguments, will
13777 pass the rest as decomposed stack elements? But an argument that will not start
13778 in registers will not be decomposed and will be passed as a non-decomposed
13781 The following is not part of the AMDGPU function calling convention but
13782 describes how the AMDGPU implements function calls:
13784 1. SGPR33 is used as a frame pointer (FP) if necessary. Like the SP it is an
13785 unswizzled scratch address. It is only needed if runtime sized ``alloca``
13786 are used, or for the reasons defined in ``SIFrameLowering``.
13787 2. Runtime stack alignment is supported. SGPR34 is used as a base pointer (BP)
13788 to access the incoming stack arguments in the function. The BP is needed
13789 only when the function requires the runtime stack alignment.
13791 3. Allocating SGPR arguments on the stack are not supported.
13793 4. No CFI is currently generated. See
13794 :ref:`amdgpu-dwarf-call-frame-information`.
13798 CFI will be generated that defines the CFA as the unswizzled address
13799 relative to the wave scratch base in the unswizzled private address space
13800 of the lowest address stack allocated local variable.
13802 ``DW_AT_frame_base`` will be defined as the swizzled address in the
13803 swizzled private address space by dividing the CFA by the wavefront size
13804 (since CFA is always at least dword aligned which matches the scratch
13805 swizzle element size).
13807 If no dynamic stack alignment was performed, the stack allocated arguments
13808 are accessed as negative offsets relative to ``DW_AT_frame_base``, and the
13809 local variables and register spill slots are accessed as positive offsets
13810 relative to ``DW_AT_frame_base``.
13812 5. Function argument passing is implemented by copying the input physical
13813 registers to virtual registers on entry. The register allocator can spill if
13814 necessary. These are copied back to physical registers at call sites. The
13815 net effect is that each function call can have these values in entirely
13816 distinct locations. The IPRA can help avoid shuffling argument registers.
13817 6. Call sites are implemented by setting up the arguments at positive offsets
13818 from SP. Then SP is incremented to account for the known frame size before
13819 the call and decremented after the call.
13823 The CFI will reflect the changed calculation needed to compute the CFA
13826 7. 4 byte spill slots are used in the stack frame. One slot is allocated for an
13827 emergency spill slot. Buffer instructions are used for stack accesses and
13828 not the ``flat_scratch`` instruction.
13832 Explain when the emergency spill slot is used.
13836 Possible broken issues:
13838 - Stack arguments must be aligned to required alignment.
13839 - Stack is aligned to max(16, max formal argument alignment)
13840 - Direct argument < 64 bits should check register budget.
13841 - Register budget calculation should respect ``inreg`` for SGPR.
13842 - SGPR overflow is not handled.
13843 - struct with 1 member unpeeling is not checking size of member.
13844 - ``sret`` is after ``this`` pointer.
13845 - Caller is not implementing stack realignment: need an extra pointer.
13846 - Should say AMDGPU passes FP rather than SP.
13847 - Should CFI define CFA as address of locals or arguments. Difference is
13848 apparent when have implemented dynamic alignment.
13849 - If ``SCRATCH`` instruction could allow negative offsets, then can make FP be
13850 highest address of stack frame and use negative offset for locals. Would
13851 allow SP to be the same as FP and could support signal-handler-like as now
13852 have a real SP for the top of the stack.
13853 - How is ``sret`` passed on the stack? In argument stack area? Can it overlay
13859 This section provides code conventions used when the target triple OS is
13860 ``amdpal`` (see :ref:`amdgpu-target-triples`).
13862 .. _amdgpu-amdpal-code-object-metadata-section:
13864 Code Object Metadata
13865 ~~~~~~~~~~~~~~~~~~~~
13869 The metadata is currently in development and is subject to major
13870 changes. Only the current version is supported. *When this document
13871 was generated the version was 2.6.*
13873 Code object metadata is specified by the ``NT_AMDGPU_METADATA`` note
13874 record (see :ref:`amdgpu-note-records-v3-onwards`).
13876 The metadata is represented as Message Pack formatted binary data (see
13877 [MsgPack]_). The top level is a Message Pack map that includes the keys
13878 defined in table :ref:`amdgpu-amdpal-code-object-metadata-map-table`
13879 and referenced tables.
13881 Additional information can be added to the maps. To avoid conflicts, any
13882 key names should be prefixed by "*vendor-name*." where ``vendor-name``
13883 can be the name of the vendor and specific vendor tool that generates the
13884 information. The prefix is abbreviated to simply "." when it appears
13885 within a map that has been added by the same *vendor-name*.
13887 .. table:: AMDPAL Code Object Metadata Map
13888 :name: amdgpu-amdpal-code-object-metadata-map-table
13890 =================== ============== ========= ======================================================================
13891 String Key Value Type Required? Description
13892 =================== ============== ========= ======================================================================
13893 "amdpal.version" sequence of Required PAL code object metadata (major, minor) version. The current values
13894 2 integers are defined by *Util::Abi::PipelineMetadata(Major|Minor)Version*.
13895 "amdpal.pipelines" sequence of Required Per-pipeline metadata. See
13896 map :ref:`amdgpu-amdpal-code-object-pipeline-metadata-map-table` for the
13897 definition of the keys included in that map.
13898 =================== ============== ========= ======================================================================
13902 .. table:: AMDPAL Code Object Pipeline Metadata Map
13903 :name: amdgpu-amdpal-code-object-pipeline-metadata-map-table
13905 ====================================== ============== ========= ===================================================
13906 String Key Value Type Required? Description
13907 ====================================== ============== ========= ===================================================
13908 ".name" string Source name of the pipeline.
13909 ".type" string Pipeline type, e.g. VsPs. Values include:
13919 ".internal_pipeline_hash" sequence of Required Internal compiler hash for this pipeline. Lower
13920 2 integers 64 bits is the "stable" portion of the hash, used
13921 for e.g. shader replacement lookup. Upper 64 bits
13922 is the "unique" portion of the hash, used for
13923 e.g. pipeline cache lookup. The value is
13924 implementation defined, and can not be relied on
13925 between different builds of the compiler.
13926 ".shaders" map Per-API shader metadata. See
13927 :ref:`amdgpu-amdpal-code-object-shader-map-table`
13928 for the definition of the keys included in that
13930 ".hardware_stages" map Per-hardware stage metadata. See
13931 :ref:`amdgpu-amdpal-code-object-hardware-stage-map-table`
13932 for the definition of the keys included in that
13934 ".shader_functions" map Per-shader function metadata. See
13935 :ref:`amdgpu-amdpal-code-object-shader-function-map-table`
13936 for the definition of the keys included in that
13938 ".registers" map Required Hardware register configuration. See
13939 :ref:`amdgpu-amdpal-code-object-register-map-table`
13940 for the definition of the keys included in that
13942 ".user_data_limit" integer Number of user data entries accessed by this
13944 ".spill_threshold" integer The user data spill threshold. 0xFFFF for
13945 NoUserDataSpilling.
13946 ".uses_viewport_array_index" boolean Indicates whether or not the pipeline uses the
13947 viewport array index feature. Pipelines which use
13948 this feature can render into all 16 viewports,
13949 whereas pipelines which do not use it are
13950 restricted to viewport #0.
13951 ".es_gs_lds_size" integer Size in bytes of LDS space used internally for
13952 handling data-passing between the ES and GS
13953 shader stages. This can be zero if the data is
13954 passed using off-chip buffers. This value should
13955 be used to program all user-SGPRs which have been
13956 marked with "UserDataMapping::EsGsLdsSize"
13957 (typically only the GS and VS HW stages will ever
13958 have a user-SGPR so marked).
13959 ".nggSubgroupSize" integer Explicit maximum subgroup size for NGG shaders
13960 (maximum number of threads in a subgroup).
13961 ".num_interpolants" integer Graphics only. Number of PS interpolants.
13962 ".mesh_scratch_memory_size" integer Max mesh shader scratch memory used.
13963 ".api" string Name of the client graphics API.
13964 ".api_create_info" binary Graphics API shader create info binary blob. Can
13965 be defined by the driver using the compiler if
13966 they want to be able to correlate API-specific
13967 information used during creation at a later time.
13968 ====================================== ============== ========= ===================================================
13972 .. table:: AMDPAL Code Object Shader Map
13973 :name: amdgpu-amdpal-code-object-shader-map-table
13976 +-------------+--------------+-------------------------------------------------------------------+
13977 |String Key |Value Type |Description |
13978 +=============+==============+===================================================================+
13979 |- ".compute" |map |See :ref:`amdgpu-amdpal-code-object-api-shader-metadata-map-table` |
13980 |- ".vertex" | |for the definition of the keys included in that map. |
13983 |- ".geometry"| | |
13985 +-------------+--------------+-------------------------------------------------------------------+
13989 .. table:: AMDPAL Code Object API Shader Metadata Map
13990 :name: amdgpu-amdpal-code-object-api-shader-metadata-map-table
13992 ==================== ============== ========= =====================================================================
13993 String Key Value Type Required? Description
13994 ==================== ============== ========= =====================================================================
13995 ".api_shader_hash" sequence of Required Input shader hash, typically passed in from the client. The value
13996 2 integers is implementation defined, and can not be relied on between
13997 different builds of the compiler.
13998 ".hardware_mapping" sequence of Required Flags indicating the HW stages this API shader maps to. Values
14009 ==================== ============== ========= =====================================================================
14013 .. table:: AMDPAL Code Object Hardware Stage Map
14014 :name: amdgpu-amdpal-code-object-hardware-stage-map-table
14016 +-------------+--------------+-----------------------------------------------------------------------+
14017 |String Key |Value Type |Description |
14018 +=============+==============+=======================================================================+
14019 |- ".ls" |map |See :ref:`amdgpu-amdpal-code-object-hardware-stage-metadata-map-table` |
14020 |- ".hs" | |for the definition of the keys included in that map. |
14026 +-------------+--------------+-----------------------------------------------------------------------+
14030 .. table:: AMDPAL Code Object Hardware Stage Metadata Map
14031 :name: amdgpu-amdpal-code-object-hardware-stage-metadata-map-table
14033 ========================== ============== ========= ===============================================================
14034 String Key Value Type Required? Description
14035 ========================== ============== ========= ===============================================================
14036 ".entry_point" string The ELF symbol pointing to this pipeline's stage entry point.
14037 ".scratch_memory_size" integer Scratch memory size in bytes.
14038 ".lds_size" integer Local Data Share size in bytes.
14039 ".perf_data_buffer_size" integer Performance data buffer size in bytes.
14040 ".vgpr_count" integer Number of VGPRs used.
14041 ".agpr_count" integer Number of AGPRs used.
14042 ".sgpr_count" integer Number of SGPRs used.
14043 ".vgpr_limit" integer If non-zero, indicates the shader was compiled with a
14044 directive to instruct the compiler to limit the VGPR usage to
14045 be less than or equal to the specified value (only set if
14046 different from HW default).
14047 ".sgpr_limit" integer SGPR count upper limit (only set if different from HW
14049 ".threadgroup_dimensions" sequence of Thread-group X/Y/Z dimensions (Compute only).
14051 ".wavefront_size" integer Wavefront size (only set if different from HW default).
14052 ".uses_uavs" boolean The shader reads or writes UAVs.
14053 ".uses_rovs" boolean The shader reads or writes ROVs.
14054 ".writes_uavs" boolean The shader writes to one or more UAVs.
14055 ".writes_depth" boolean The shader writes out a depth value.
14056 ".uses_append_consume" boolean The shader uses append and/or consume operations, either
14058 ".uses_prim_id" boolean The shader uses PrimID.
14059 ========================== ============== ========= ===============================================================
14063 .. table:: AMDPAL Code Object Shader Function Map
14064 :name: amdgpu-amdpal-code-object-shader-function-map-table
14066 =============== ============== ====================================================================
14067 String Key Value Type Description
14068 =============== ============== ====================================================================
14069 *symbol name* map *symbol name* is the ELF symbol name of the shader function code
14070 entry address. The value is the function's metadata. See
14071 :ref:`amdgpu-amdpal-code-object-shader-function-metadata-map-table`.
14072 =============== ============== ====================================================================
14076 .. table:: AMDPAL Code Object Shader Function Metadata Map
14077 :name: amdgpu-amdpal-code-object-shader-function-metadata-map-table
14079 ============================= ============== =================================================================
14080 String Key Value Type Description
14081 ============================= ============== =================================================================
14082 ".api_shader_hash" sequence of Input shader hash, typically passed in from the client. The value
14083 2 integers is implementation defined, and can not be relied on between
14084 different builds of the compiler.
14085 ".scratch_memory_size" integer Size in bytes of scratch memory used by the shader.
14086 ".lds_size" integer Size in bytes of LDS memory.
14087 ".vgpr_count" integer Number of VGPRs used by the shader.
14088 ".sgpr_count" integer Number of SGPRs used by the shader.
14089 ".stack_frame_size_in_bytes" integer Amount of stack size used by the shader.
14090 ".shader_subtype" string Shader subtype/kind. Values include:
14094 ============================= ============== =================================================================
14098 .. table:: AMDPAL Code Object Register Map
14099 :name: amdgpu-amdpal-code-object-register-map-table
14101 ========================== ============== ====================================================================
14102 32-bit Integer Key Value Type Description
14103 ========================== ============== ====================================================================
14104 ``reg offset`` 32-bit integer ``reg offset`` is the dword offset into the GFXIP register space of
14105 a GRBM register (i.e., driver accessible GPU register number, not
14106 shader GPR register number). The driver is required to program each
14107 specified register to the corresponding specified value when
14108 executing this pipeline. Typically, the ``reg offsets`` are the
14109 ``uint16_t`` offsets to each register as defined by the hardware
14110 chip headers. The register is set to the provided value. However, a
14111 ``reg offset`` that specifies a user data register (e.g.,
14112 COMPUTE_USER_DATA_0) needs special treatment. See
14113 :ref:`amdgpu-amdpal-code-object-user-data-section` section for more
14115 ========================== ============== ====================================================================
14117 .. _amdgpu-amdpal-code-object-user-data-section:
14122 Each hardware stage has a set of 32-bit physical SPI *user data registers*
14123 (either 16 or 32 based on graphics IP and the stage) which can be
14124 written from a command buffer and then loaded into SGPRs when waves are
14125 launched via a subsequent dispatch or draw operation. This is the way
14126 most arguments are passed from the application/runtime to a hardware
14129 PAL abstracts this functionality by exposing a set of 128 *user data
14130 entries* per pipeline a client can use to pass arguments from a command
14131 buffer to one or more shaders in that pipeline. The ELF code object must
14132 specify a mapping from virtualized *user data entries* to physical *user
14133 data registers*, and PAL is responsible for implementing that mapping,
14134 including spilling overflow *user data entries* to memory if needed.
14136 Since the *user data registers* are GRBM-accessible SPI registers, this
14137 mapping is actually embedded in the ``.registers`` metadata entry. For
14138 most registers, the value in that map is a literal 32-bit value that
14139 should be written to the register by the driver. However, when the
14140 register is a *user data register* (any USER_DATA register e.g.,
14141 SPI_SHADER_USER_DATA_PS_5), the value is instead an encoding that tells
14142 the driver to write either a *user data entry* value or one of several
14143 driver-internal values to the register. This encoding is described in
14144 the following table:
14148 Currently, *user data registers* 0 and 1 (e.g., SPI_SHADER_USER_DATA_PS_0,
14149 and SPI_SHADER_USER_DATA_PS_1) are reserved. *User data register* 0 must
14150 always be programmed to the address of the GlobalTable, and *user data
14151 register* 1 must always be programmed to the address of the PerShaderTable.
14155 .. table:: AMDPAL User Data Mapping
14156 :name: amdgpu-amdpal-code-object-metadata-user-data-mapping-table
14158 ========== ================= ===============================================================================
14159 Value Name Description
14160 ========== ================= ===============================================================================
14161 0..127 *User Data Entry* 32-bit value of user_data_entry[N] as specified via *CmdSetUserData()*
14162 0x10000000 GlobalTable 32-bit pointer to GPU memory containing the global internal table (should
14163 always point to *user data register* 0).
14164 0x10000001 PerShaderTable 32-bit pointer to GPU memory containing the per-shader internal table. See
14165 :ref:`amdgpu-amdpal-code-object-metadata-user-data-per-shader-table-section`
14166 for more detail (should always point to *user data register* 1).
14167 0x10000002 SpillTable 32-bit pointer to GPU memory containing the user data spill table. See
14168 :ref:`amdgpu-amdpal-code-object-metadata-user-data-spill-table-section` for
14170 0x10000003 BaseVertex Vertex offset (32-bit unsigned integer). Not needed if the pipeline doesn't
14171 reference the draw index in the vertex shader. Only supported by the first
14172 stage in a graphics pipeline.
14173 0x10000004 BaseInstance Instance offset (32-bit unsigned integer). Only supported by the first stage in
14174 a graphics pipeline.
14175 0x10000005 DrawIndex Draw index (32-bit unsigned integer). Only supported by the first stage in a
14177 0x10000006 Workgroup Thread group count (32-bit unsigned integer). Low half of a 64-bit address of
14178 a buffer containing the grid dimensions for a Compute dispatch operation. The
14179 high half of the address is stored in the next sequential user-SGPR. Only
14180 supported by compute pipelines.
14181 0x1000000A EsGsLdsSize Indicates that PAL will program this user-SGPR to contain the amount of LDS
14182 space used for the ES/GS pseudo-ring-buffer for passing data between shader
14184 0x1000000B ViewId View id (32-bit unsigned integer) identifies a view of graphic
14185 pipeline instancing.
14186 0x1000000C StreamOutTable 32-bit pointer to GPU memory containing the stream out target SRD table. This
14187 can only appear for one shader stage per pipeline.
14188 0x1000000D PerShaderPerfData 32-bit pointer to GPU memory containing the per-shader performance data buffer.
14189 0x1000000F VertexBufferTable 32-bit pointer to GPU memory containing the vertex buffer SRD table. This can
14190 only appear for one shader stage per pipeline.
14191 0x10000010 UavExportTable 32-bit pointer to GPU memory containing the UAV export SRD table. This can
14192 only appear for one shader stage per pipeline (PS). These replace color targets
14193 and are completely separate from any UAVs used by the shader. This is optional,
14194 and only used by the PS when UAV exports are used to replace color-target
14195 exports to optimize specific shaders.
14196 0x10000011 NggCullingData 64-bit pointer to GPU memory containing the hardware register data needed by
14197 some NGG pipelines to perform culling. This value contains the address of the
14198 first of two consecutive registers which provide the full GPU address.
14199 0x10000015 FetchShaderPtr 64-bit pointer to GPU memory containing the fetch shader subroutine.
14200 ========== ================= ===============================================================================
14202 .. _amdgpu-amdpal-code-object-metadata-user-data-per-shader-table-section:
14207 Low 32 bits of the GPU address for an optional buffer in the ``.data``
14208 section of the ELF. The high 32 bits of the address match the high 32 bits
14209 of the shader's program counter.
14211 The buffer can be anything the shader compiler needs it for, and
14212 allows each shader to have its own region of the ``.data`` section.
14213 Typically, this could be a table of buffer SRD's and the data pointed to
14214 by the buffer SRD's, but it could be a flat-address region of memory as
14215 well. Its layout and usage are defined by the shader compiler.
14217 Each shader's table in the ``.data`` section is referenced by the symbol
14218 ``_amdgpu_``\ *xs*\ ``_shdr_intrl_data`` where *xs* corresponds with the
14219 hardware shader stage the data is for. E.g.,
14220 ``_amdgpu_cs_shdr_intrl_data`` for the compute shader hardware stage.
14222 .. _amdgpu-amdpal-code-object-metadata-user-data-spill-table-section:
14227 It is possible for a hardware shader to need access to more *user data
14228 entries* than there are slots available in user data registers for one
14229 or more hardware shader stages. In that case, the PAL runtime expects
14230 the necessary *user data entries* to be spilled to GPU memory and use
14231 one user data register to point to the spilled user data memory. The
14232 value of the *user data entry* must then represent the location where
14233 a shader expects to read the low 32-bits of the table's GPU virtual
14234 address. The *spill table* itself represents a set of 32-bit values
14235 managed by the PAL runtime in GPU-accessible memory that can be made
14236 indirectly accessible to a hardware shader.
14241 This section provides code conventions used when the target triple OS is
14242 empty (see :ref:`amdgpu-target-triples`).
14247 For code objects generated by AMDGPU backend for non-amdhsa OS, the runtime does
14248 not install a trap handler. The ``llvm.trap`` and ``llvm.debugtrap``
14249 instructions are handled as follows:
14251 .. table:: AMDGPU Trap Handler for Non-AMDHSA OS
14252 :name: amdgpu-trap-handler-for-non-amdhsa-os-table
14254 =============== =============== ===========================================
14255 Usage Code Sequence Description
14256 =============== =============== ===========================================
14257 llvm.trap s_endpgm Causes wavefront to be terminated.
14258 llvm.debugtrap *none* Compiler warning given that there is no
14259 trap handler installed.
14260 =============== =============== ===========================================
14270 When the language is OpenCL the following differences occur:
14272 1. The OpenCL memory model is used (see :ref:`amdgpu-amdhsa-memory-model`).
14273 2. The AMDGPU backend appends additional arguments to the kernel's explicit
14274 arguments for the AMDHSA OS (see
14275 :ref:`opencl-kernel-implicit-arguments-appended-for-amdhsa-os-table`).
14276 3. Additional metadata is generated
14277 (see :ref:`amdgpu-amdhsa-code-object-metadata`).
14279 .. table:: OpenCL kernel implicit arguments appended for AMDHSA OS
14280 :name: opencl-kernel-implicit-arguments-appended-for-amdhsa-os-table
14282 ======== ==== ========= ===========================================
14283 Position Byte Byte Description
14285 ======== ==== ========= ===========================================
14286 1 8 8 OpenCL Global Offset X
14287 2 8 8 OpenCL Global Offset Y
14288 3 8 8 OpenCL Global Offset Z
14289 4 8 8 OpenCL address of printf buffer
14290 5 8 8 OpenCL address of virtual queue used by
14292 6 8 8 OpenCL address of AqlWrap struct used by
14294 7 8 8 Pointer argument used for Multi-gird
14296 ======== ==== ========= ===========================================
14303 When the language is HCC the following differences occur:
14305 1. The HSA memory model is used (see :ref:`amdgpu-amdhsa-memory-model`).
14307 .. _amdgpu-assembler:
14312 AMDGPU backend has LLVM-MC based assembler which is currently in development.
14313 It supports AMDGCN GFX6-GFX11.
14315 This section describes general syntax for instructions and operands.
14320 An instruction has the following :doc:`syntax<AMDGPUInstructionSyntax>`:
14322 | ``<``\ *opcode*\ ``> <``\ *operand0*\ ``>, <``\ *operand1*\ ``>,...
14323 <``\ *modifier0*\ ``> <``\ *modifier1*\ ``>...``
14325 :doc:`Operands<AMDGPUOperandSyntax>` are comma-separated while
14326 :doc:`modifiers<AMDGPUModifierSyntax>` are space-separated.
14328 The order of operands and modifiers is fixed.
14329 Most modifiers are optional and may be omitted.
14331 Links to detailed instruction syntax description may be found in the following
14332 table. Note that features under development are not included
14333 in this description.
14335 ============= ============================================= =======================================
14336 Architecture Core ISA ISA Variants and Extensions
14337 ============= ============================================= =======================================
14338 GCN 2 :doc:`GFX7<AMDGPU/AMDGPUAsmGFX7>` \-
14339 GCN 3, GCN 4 :doc:`GFX8<AMDGPU/AMDGPUAsmGFX8>` \-
14340 GCN 5 :doc:`GFX9<AMDGPU/AMDGPUAsmGFX9>` :doc:`gfx900<AMDGPU/AMDGPUAsmGFX900>`
14342 :doc:`gfx902<AMDGPU/AMDGPUAsmGFX900>`
14344 :doc:`gfx904<AMDGPU/AMDGPUAsmGFX904>`
14346 :doc:`gfx906<AMDGPU/AMDGPUAsmGFX906>`
14348 :doc:`gfx909<AMDGPU/AMDGPUAsmGFX900>`
14350 :doc:`gfx90c<AMDGPU/AMDGPUAsmGFX900>`
14352 CDNA 1 :doc:`GFX9<AMDGPU/AMDGPUAsmGFX9>` :doc:`gfx908<AMDGPU/AMDGPUAsmGFX908>`
14354 CDNA 2 :doc:`GFX9<AMDGPU/AMDGPUAsmGFX9>` :doc:`gfx90a<AMDGPU/AMDGPUAsmGFX90a>`
14356 CDNA 3 :doc:`GFX9<AMDGPU/AMDGPUAsmGFX9>` :doc:`gfx940<AMDGPU/AMDGPUAsmGFX940>`
14358 RDNA 1 :doc:`GFX10 RDNA1<AMDGPU/AMDGPUAsmGFX10>` :doc:`gfx1010<AMDGPU/AMDGPUAsmGFX10>`
14360 :doc:`gfx1011<AMDGPU/AMDGPUAsmGFX1011>`
14362 :doc:`gfx1012<AMDGPU/AMDGPUAsmGFX1011>`
14364 :doc:`gfx1013<AMDGPU/AMDGPUAsmGFX1013>`
14366 RDNA 2 :doc:`GFX10 RDNA2<AMDGPU/AMDGPUAsmGFX1030>` :doc:`gfx1030<AMDGPU/AMDGPUAsmGFX1030>`
14368 :doc:`gfx1031<AMDGPU/AMDGPUAsmGFX1030>`
14370 :doc:`gfx1032<AMDGPU/AMDGPUAsmGFX1030>`
14372 :doc:`gfx1033<AMDGPU/AMDGPUAsmGFX1030>`
14374 :doc:`gfx1034<AMDGPU/AMDGPUAsmGFX1030>`
14376 :doc:`gfx1035<AMDGPU/AMDGPUAsmGFX1030>`
14378 :doc:`gfx1036<AMDGPU/AMDGPUAsmGFX1030>`
14380 RDNA 3 :doc:`GFX11<AMDGPU/AMDGPUAsmGFX11>` :doc:`gfx1100<AMDGPU/AMDGPUAsmGFX11>`
14382 :doc:`gfx1101<AMDGPU/AMDGPUAsmGFX11>`
14384 :doc:`gfx1102<AMDGPU/AMDGPUAsmGFX11>`
14386 :doc:`gfx1103<AMDGPU/AMDGPUAsmGFX11>`
14387 ============= ============================================= =======================================
14389 For more information about instructions, their semantics and supported
14390 combinations of operands, refer to one of instruction set architecture manuals
14391 [AMD-GCN-GFX6]_, [AMD-GCN-GFX7]_, [AMD-GCN-GFX8]_,
14392 [AMD-GCN-GFX900-GFX904-VEGA]_, [AMD-GCN-GFX906-VEGA7NM]_,
14393 [AMD-GCN-GFX908-CDNA1]_, [AMD-GCN-GFX90A-CDNA2]_, [AMD-GCN-GFX10-RDNA1]_,
14394 [AMD-GCN-GFX10-RDNA2]_ and [AMD-GCN-GFX11-RDNA3]_.
14399 Detailed description of operands may be found :doc:`here<AMDGPUOperandSyntax>`.
14404 Detailed description of modifiers may be found
14405 :doc:`here<AMDGPUModifierSyntax>`.
14407 Instruction Examples
14408 ~~~~~~~~~~~~~~~~~~~~
14413 .. code-block:: nasm
14415 ds_add_u32 v2, v4 offset:16
14416 ds_write_src2_b64 v2 offset0:4 offset1:8
14417 ds_cmpst_f32 v2, v4, v6
14418 ds_min_rtn_f64 v[8:9], v2, v[4:5]
14420 For full list of supported instructions, refer to "LDS/GDS instructions" in ISA
14426 .. code-block:: nasm
14428 flat_load_dword v1, v[3:4]
14429 flat_store_dwordx3 v[3:4], v[5:7]
14430 flat_atomic_swap v1, v[3:4], v5 glc
14431 flat_atomic_cmpswap v1, v[3:4], v[5:6] glc slc
14432 flat_atomic_fmax_x2 v[1:2], v[3:4], v[5:6] glc
14434 For full list of supported instructions, refer to "FLAT instructions" in ISA
14440 .. code-block:: nasm
14442 buffer_load_dword v1, off, s[4:7], s1
14443 buffer_store_dwordx4 v[1:4], v2, ttmp[4:7], s1 offen offset:4 glc tfe
14444 buffer_store_format_xy v[1:2], off, s[4:7], s1
14446 buffer_atomic_inc v1, v2, s[8:11], s4 idxen offset:4 slc
14448 For full list of supported instructions, refer to "MUBUF Instructions" in ISA
14454 .. code-block:: nasm
14456 s_load_dword s1, s[2:3], 0xfc
14457 s_load_dwordx8 s[8:15], s[2:3], s4
14458 s_load_dwordx16 s[88:103], s[2:3], s4
14462 For full list of supported instructions, refer to "Scalar Memory Operations" in
14468 .. code-block:: nasm
14471 s_mov_b64 s[0:1], 0x80000000
14473 s_wqm_b64 s[2:3], s[4:5]
14474 s_bcnt0_i32_b64 s1, s[2:3]
14475 s_swappc_b64 s[2:3], s[4:5]
14476 s_cbranch_join s[4:5]
14478 For full list of supported instructions, refer to "SOP1 Instructions" in ISA
14484 .. code-block:: nasm
14486 s_add_u32 s1, s2, s3
14487 s_and_b64 s[2:3], s[4:5], s[6:7]
14488 s_cselect_b32 s1, s2, s3
14489 s_andn2_b32 s2, s4, s6
14490 s_lshr_b64 s[2:3], s[4:5], s6
14491 s_ashr_i32 s2, s4, s6
14492 s_bfm_b64 s[2:3], s4, s6
14493 s_bfe_i64 s[2:3], s[4:5], s6
14494 s_cbranch_g_fork s[4:5], s[6:7]
14496 For full list of supported instructions, refer to "SOP2 Instructions" in ISA
14502 .. code-block:: nasm
14504 s_cmp_eq_i32 s1, s2
14505 s_bitcmp1_b32 s1, s2
14506 s_bitcmp0_b64 s[2:3], s4
14509 For full list of supported instructions, refer to "SOPC Instructions" in ISA
14515 .. code-block:: nasm
14520 s_waitcnt 0 ; Wait for all counters to be 0
14521 s_waitcnt vmcnt(0) & expcnt(0) & lgkmcnt(0) ; Equivalent to above
14522 s_waitcnt vmcnt(1) ; Wait for vmcnt counter to be 1.
14526 s_sendmsg sendmsg(MSG_INTERRUPT)
14529 For full list of supported instructions, refer to "SOPP Instructions" in ISA
14532 Unless otherwise mentioned, little verification is performed on the operands
14533 of SOPP Instructions, so it is up to the programmer to be familiar with the
14534 range or acceptable values.
14539 For vector ALU instruction opcodes (VOP1, VOP2, VOP3, VOPC, VOP_DPP, VOP_SDWA),
14540 the assembler will automatically use optimal encoding based on its operands. To
14541 force specific encoding, one can add a suffix to the opcode of the instruction:
14543 * _e32 for 32-bit VOP1/VOP2/VOPC
14544 * _e64 for 64-bit VOP3
14546 * _sdwa for VOP_SDWA
14548 VOP1/VOP2/VOP3/VOPC examples:
14550 .. code-block:: nasm
14553 v_mov_b32_e32 v1, v2
14555 v_cvt_f64_i32_e32 v[1:2], v2
14556 v_floor_f32_e32 v1, v2
14557 v_bfrev_b32_e32 v1, v2
14558 v_add_f32_e32 v1, v2, v3
14559 v_mul_i32_i24_e64 v1, v2, 3
14560 v_mul_i32_i24_e32 v1, -3, v3
14561 v_mul_i32_i24_e32 v1, -100, v3
14562 v_addc_u32 v1, s[0:1], v2, v3, s[2:3]
14563 v_max_f16_e32 v1, v2, v3
14567 .. code-block:: nasm
14569 v_mov_b32 v0, v0 quad_perm:[0,2,1,1]
14570 v_sin_f32 v0, v0 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
14571 v_mov_b32 v0, v0 wave_shl:1
14572 v_mov_b32 v0, v0 row_mirror
14573 v_mov_b32 v0, v0 row_bcast:31
14574 v_mov_b32 v0, v0 quad_perm:[1,3,0,1] row_mask:0xa bank_mask:0x1 bound_ctrl:0
14575 v_add_f32 v0, v0, |v0| row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
14576 v_max_f16 v1, v2, v3 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
14580 .. code-block:: nasm
14582 v_mov_b32 v1, v2 dst_sel:BYTE_0 dst_unused:UNUSED_PRESERVE src0_sel:DWORD
14583 v_min_u32 v200, v200, v1 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_1 src1_sel:DWORD
14584 v_sin_f32 v0, v0 dst_unused:UNUSED_PAD src0_sel:WORD_1
14585 v_fract_f32 v0, |v0| dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1
14586 v_cmpx_le_u32 vcc, v1, v2 src0_sel:BYTE_2 src1_sel:WORD_0
14588 For full list of supported instructions, refer to "Vector ALU instructions".
14590 .. _amdgpu-amdhsa-assembler-predefined-symbols-v2:
14592 Code Object V2 Predefined Symbols
14593 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
14596 Code object V2 is not the default code object version emitted by
14597 this version of LLVM.
14599 The AMDGPU assembler defines and updates some symbols automatically. These
14600 symbols do not affect code generation.
14602 .option.machine_version_major
14603 +++++++++++++++++++++++++++++
14605 Set to the GFX major generation number of the target being assembled for. For
14606 example, when assembling for a "GFX9" target this will be set to the integer
14607 value "9". The possible GFX major generation numbers are presented in
14608 :ref:`amdgpu-processors`.
14610 .option.machine_version_minor
14611 +++++++++++++++++++++++++++++
14613 Set to the GFX minor generation number of the target being assembled for. For
14614 example, when assembling for a "GFX810" target this will be set to the integer
14615 value "1". The possible GFX minor generation numbers are presented in
14616 :ref:`amdgpu-processors`.
14618 .option.machine_version_stepping
14619 ++++++++++++++++++++++++++++++++
14621 Set to the GFX stepping generation number of the target being assembled for.
14622 For example, when assembling for a "GFX704" target this will be set to the
14623 integer value "4". The possible GFX stepping generation numbers are presented
14624 in :ref:`amdgpu-processors`.
14629 Set to zero each time a
14630 :ref:`amdgpu-amdhsa-assembler-directive-amdgpu_hsa_kernel` directive is
14631 encountered. At each instruction, if the current value of this symbol is less
14632 than or equal to the maximum VGPR number explicitly referenced within that
14633 instruction then the symbol value is updated to equal that VGPR number plus
14639 Set to zero each time a
14640 :ref:`amdgpu-amdhsa-assembler-directive-amdgpu_hsa_kernel` directive is
14641 encountered. At each instruction, if the current value of this symbol is less
14642 than or equal to the maximum VGPR number explicitly referenced within that
14643 instruction then the symbol value is updated to equal that SGPR number plus
14646 .. _amdgpu-amdhsa-assembler-directives-v2:
14648 Code Object V2 Directives
14649 ~~~~~~~~~~~~~~~~~~~~~~~~~
14652 Code object V2 is not the default code object version emitted by
14653 this version of LLVM.
14655 AMDGPU ABI defines auxiliary data in output code object. In assembly source,
14656 one can specify them with assembler directives.
14658 .hsa_code_object_version major, minor
14659 +++++++++++++++++++++++++++++++++++++
14661 *major* and *minor* are integers that specify the version of the HSA code
14662 object that will be generated by the assembler.
14664 .hsa_code_object_isa [major, minor, stepping, vendor, arch]
14665 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
14668 *major*, *minor*, and *stepping* are all integers that describe the instruction
14669 set architecture (ISA) version of the assembly program.
14671 *vendor* and *arch* are quoted strings. *vendor* should always be equal to
14672 "AMD" and *arch* should always be equal to "AMDGPU".
14674 By default, the assembler will derive the ISA version, *vendor*, and *arch*
14675 from the value of the -mcpu option that is passed to the assembler.
14677 .. _amdgpu-amdhsa-assembler-directive-amdgpu_hsa_kernel:
14679 .amdgpu_hsa_kernel (name)
14680 +++++++++++++++++++++++++
14682 This directives specifies that the symbol with given name is a kernel entry
14683 point (label) and the object should contain corresponding symbol of type
14684 STT_AMDGPU_HSA_KERNEL.
14689 This directive marks the beginning of a list of key / value pairs that are used
14690 to specify the amd_kernel_code_t object that will be emitted by the assembler.
14691 The list must be terminated by the *.end_amd_kernel_code_t* directive. For any
14692 amd_kernel_code_t values that are unspecified a default value will be used. The
14693 default value for all keys is 0, with the following exceptions:
14695 - *amd_code_version_major* defaults to 1.
14696 - *amd_kernel_code_version_minor* defaults to 2.
14697 - *amd_machine_kind* defaults to 1.
14698 - *amd_machine_version_major*, *machine_version_minor*, and
14699 *amd_machine_version_stepping* are derived from the value of the -mcpu option
14700 that is passed to the assembler.
14701 - *kernel_code_entry_byte_offset* defaults to 256.
14702 - *wavefront_size* defaults 6 for all targets before GFX10. For GFX10 onwards
14703 defaults to 6 if target feature ``wavefrontsize64`` is enabled, otherwise 5.
14704 Note that wavefront size is specified as a power of two, so a value of **n**
14705 means a size of 2^ **n**.
14706 - *call_convention* defaults to -1.
14707 - *kernarg_segment_alignment*, *group_segment_alignment*, and
14708 *private_segment_alignment* default to 4. Note that alignments are specified
14709 as a power of 2, so a value of **n** means an alignment of 2^ **n**.
14710 - *enable_tg_split* defaults to 1 if target feature ``tgsplit`` is enabled for
14712 - *enable_wgp_mode* defaults to 1 if target feature ``cumode`` is disabled for
14714 - *enable_mem_ordered* defaults to 1 for GFX10 onwards.
14716 The *.amd_kernel_code_t* directive must be placed immediately after the
14717 function label and before any instructions.
14719 For a full list of amd_kernel_code_t keys, refer to AMDGPU ABI document,
14720 comments in lib/Target/AMDGPU/AmdKernelCodeT.h and test/CodeGen/AMDGPU/hsa.s.
14722 .. _amdgpu-amdhsa-assembler-example-v2:
14724 Code Object V2 Example Source Code
14725 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
14728 Code Object V2 is not the default code object version emitted by
14729 this version of LLVM.
14731 Here is an example of a minimal assembly source file, defining one HSA kernel:
14736 .hsa_code_object_version 1,0
14737 .hsa_code_object_isa
14742 .amdgpu_hsa_kernel hello_world
14747 enable_sgpr_kernarg_segment_ptr = 1
14749 compute_pgm_rsrc1_vgprs = 0
14750 compute_pgm_rsrc1_sgprs = 0
14751 compute_pgm_rsrc2_user_sgpr = 2
14752 compute_pgm_rsrc1_wgp_mode = 0
14753 compute_pgm_rsrc1_mem_ordered = 0
14754 compute_pgm_rsrc1_fwd_progress = 1
14755 .end_amd_kernel_code_t
14757 s_load_dwordx2 s[0:1], s[0:1] 0x0
14758 v_mov_b32 v0, 3.14159
14759 s_waitcnt lgkmcnt(0)
14762 flat_store_dword v[1:2], v0
14765 .size hello_world, .Lfunc_end0-hello_world
14767 .. _amdgpu-amdhsa-assembler-predefined-symbols-v3-onwards:
14769 Code Object V3 and Above Predefined Symbols
14770 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
14772 The AMDGPU assembler defines and updates some symbols automatically. These
14773 symbols do not affect code generation.
14775 .amdgcn.gfx_generation_number
14776 +++++++++++++++++++++++++++++
14778 Set to the GFX major generation number of the target being assembled for. For
14779 example, when assembling for a "GFX9" target this will be set to the integer
14780 value "9". The possible GFX major generation numbers are presented in
14781 :ref:`amdgpu-processors`.
14783 .amdgcn.gfx_generation_minor
14784 ++++++++++++++++++++++++++++
14786 Set to the GFX minor generation number of the target being assembled for. For
14787 example, when assembling for a "GFX810" target this will be set to the integer
14788 value "1". The possible GFX minor generation numbers are presented in
14789 :ref:`amdgpu-processors`.
14791 .amdgcn.gfx_generation_stepping
14792 +++++++++++++++++++++++++++++++
14794 Set to the GFX stepping generation number of the target being assembled for.
14795 For example, when assembling for a "GFX704" target this will be set to the
14796 integer value "4". The possible GFX stepping generation numbers are presented
14797 in :ref:`amdgpu-processors`.
14799 .. _amdgpu-amdhsa-assembler-symbol-next_free_vgpr:
14801 .amdgcn.next_free_vgpr
14802 ++++++++++++++++++++++
14804 Set to zero before assembly begins. At each instruction, if the current value
14805 of this symbol is less than or equal to the maximum VGPR number explicitly
14806 referenced within that instruction then the symbol value is updated to equal
14807 that VGPR number plus one.
14809 May be used to set the `.amdhsa_next_free_vgpr` directive in
14810 :ref:`amdhsa-kernel-directives-table`.
14812 May be set at any time, e.g. manually set to zero at the start of each kernel.
14814 .. _amdgpu-amdhsa-assembler-symbol-next_free_sgpr:
14816 .amdgcn.next_free_sgpr
14817 ++++++++++++++++++++++
14819 Set to zero before assembly begins. At each instruction, if the current value
14820 of this symbol is less than or equal the maximum SGPR number explicitly
14821 referenced within that instruction then the symbol value is updated to equal
14822 that SGPR number plus one.
14824 May be used to set the `.amdhsa_next_free_spgr` directive in
14825 :ref:`amdhsa-kernel-directives-table`.
14827 May be set at any time, e.g. manually set to zero at the start of each kernel.
14829 .. _amdgpu-amdhsa-assembler-directives-v3-onwards:
14831 Code Object V3 and Above Directives
14832 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
14834 Directives which begin with ``.amdgcn`` are valid for all ``amdgcn``
14835 architecture processors, and are not OS-specific. Directives which begin with
14836 ``.amdhsa`` are specific to ``amdgcn`` architecture processors when the
14837 ``amdhsa`` OS is specified. See :ref:`amdgpu-target-triples` and
14838 :ref:`amdgpu-processors`.
14840 .. _amdgpu-assembler-directive-amdgcn-target:
14842 .amdgcn_target <target-triple> "-" <target-id>
14843 ++++++++++++++++++++++++++++++++++++++++++++++
14845 Optional directive which declares the ``<target-triple>-<target-id>`` supported
14846 by the containing assembler source file. Used by the assembler to validate
14847 command-line options such as ``-triple``, ``-mcpu``, and
14848 ``--offload-arch=<target-id>``. A non-canonical target ID is allowed. See
14849 :ref:`amdgpu-target-triples` and :ref:`amdgpu-target-id`.
14853 The target ID syntax used for code object V2 to V3 for this directive differs
14854 from that used elsewhere. See :ref:`amdgpu-target-id-v2-v3`.
14856 .amdhsa_kernel <name>
14857 +++++++++++++++++++++
14859 Creates a correctly aligned AMDHSA kernel descriptor and a symbol,
14860 ``<name>.kd``, in the current location of the current section. Only valid when
14861 the OS is ``amdhsa``. ``<name>`` must be a symbol that labels the first
14862 instruction to execute, and does not need to be previously defined.
14864 Marks the beginning of a list of directives used to generate the bytes of a
14865 kernel descriptor, as described in :ref:`amdgpu-amdhsa-kernel-descriptor`.
14866 Directives which may appear in this list are described in
14867 :ref:`amdhsa-kernel-directives-table`. Directives may appear in any order, must
14868 be valid for the target being assembled for, and cannot be repeated. Directives
14869 support the range of values specified by the field they reference in
14870 :ref:`amdgpu-amdhsa-kernel-descriptor`. If a directive is not specified, it is
14871 assumed to have its default value, unless it is marked as "Required", in which
14872 case it is an error to omit the directive. This list of directives is
14873 terminated by an ``.end_amdhsa_kernel`` directive.
14875 .. table:: AMDHSA Kernel Assembler Directives
14876 :name: amdhsa-kernel-directives-table
14878 ======================================================== =================== ============ ===================
14879 Directive Default Supported On Description
14880 ======================================================== =================== ============ ===================
14881 ``.amdhsa_group_segment_fixed_size`` 0 GFX6-GFX11 Controls GROUP_SEGMENT_FIXED_SIZE in
14882 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
14883 ``.amdhsa_private_segment_fixed_size`` 0 GFX6-GFX11 Controls PRIVATE_SEGMENT_FIXED_SIZE in
14884 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
14885 ``.amdhsa_kernarg_size`` 0 GFX6-GFX11 Controls KERNARG_SIZE in
14886 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
14887 ``.amdhsa_user_sgpr_count`` 0 GFX6-GFX11 Controls USER_SGPR_COUNT in COMPUTE_PGM_RSRC2
14888 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx11-table`
14889 ``.amdhsa_user_sgpr_private_segment_buffer`` 0 GFX6-GFX10 Controls ENABLE_SGPR_PRIVATE_SEGMENT_BUFFER in
14890 (except :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
14892 ``.amdhsa_user_sgpr_dispatch_ptr`` 0 GFX6-GFX11 Controls ENABLE_SGPR_DISPATCH_PTR in
14893 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
14894 ``.amdhsa_user_sgpr_queue_ptr`` 0 GFX6-GFX11 Controls ENABLE_SGPR_QUEUE_PTR in
14895 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
14896 ``.amdhsa_user_sgpr_kernarg_segment_ptr`` 0 GFX6-GFX11 Controls ENABLE_SGPR_KERNARG_SEGMENT_PTR in
14897 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
14898 ``.amdhsa_user_sgpr_dispatch_id`` 0 GFX6-GFX11 Controls ENABLE_SGPR_DISPATCH_ID in
14899 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
14900 ``.amdhsa_user_sgpr_flat_scratch_init`` 0 GFX6-GFX10 Controls ENABLE_SGPR_FLAT_SCRATCH_INIT in
14901 (except :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
14903 ``.amdhsa_user_sgpr_private_segment_size`` 0 GFX6-GFX11 Controls ENABLE_SGPR_PRIVATE_SEGMENT_SIZE in
14904 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
14905 ``.amdhsa_wavefront_size32`` Target GFX10-GFX11 Controls ENABLE_WAVEFRONT_SIZE32 in
14906 Feature :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
14909 ``.amdhsa_uses_dynamic_stack`` 0 GFX6-GFX11 Controls USES_DYNAMIC_STACK in
14910 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
14911 ``.amdhsa_system_sgpr_private_segment_wavefront_offset`` 0 GFX6-GFX10 Controls ENABLE_PRIVATE_SEGMENT in
14912 (except :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx11-table`.
14914 ``.amdhsa_enable_private_segment`` 0 GFX940, Controls ENABLE_PRIVATE_SEGMENT in
14915 GFX11 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx11-table`.
14916 ``.amdhsa_system_sgpr_workgroup_id_x`` 1 GFX6-GFX11 Controls ENABLE_SGPR_WORKGROUP_ID_X in
14917 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx11-table`.
14918 ``.amdhsa_system_sgpr_workgroup_id_y`` 0 GFX6-GFX11 Controls ENABLE_SGPR_WORKGROUP_ID_Y in
14919 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx11-table`.
14920 ``.amdhsa_system_sgpr_workgroup_id_z`` 0 GFX6-GFX11 Controls ENABLE_SGPR_WORKGROUP_ID_Z in
14921 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx11-table`.
14922 ``.amdhsa_system_sgpr_workgroup_info`` 0 GFX6-GFX11 Controls ENABLE_SGPR_WORKGROUP_INFO in
14923 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx11-table`.
14924 ``.amdhsa_system_vgpr_workitem_id`` 0 GFX6-GFX11 Controls ENABLE_VGPR_WORKITEM_ID in
14925 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx11-table`.
14926 Possible values are defined in
14927 :ref:`amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table`.
14928 ``.amdhsa_next_free_vgpr`` Required GFX6-GFX11 Maximum VGPR number explicitly referenced, plus one.
14929 Used to calculate GRANULATED_WORKITEM_VGPR_COUNT in
14930 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx11-table`.
14931 ``.amdhsa_next_free_sgpr`` Required GFX6-GFX11 Maximum SGPR number explicitly referenced, plus one.
14932 Used to calculate GRANULATED_WAVEFRONT_SGPR_COUNT in
14933 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx11-table`.
14934 ``.amdhsa_accum_offset`` Required GFX90A, Offset of a first AccVGPR in the unified register file.
14935 GFX940 Used to calculate ACCUM_OFFSET in
14936 :ref:`amdgpu-amdhsa-compute_pgm_rsrc3-gfx90a-table`.
14937 ``.amdhsa_reserve_vcc`` 1 GFX6-GFX11 Whether the kernel may use the special VCC SGPR.
14938 Used to calculate GRANULATED_WAVEFRONT_SGPR_COUNT in
14939 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx11-table`.
14940 ``.amdhsa_reserve_flat_scratch`` 1 GFX7-GFX10 Whether the kernel may use flat instructions to access
14941 (except scratch memory. Used to calculate
14942 GFX940) GRANULATED_WAVEFRONT_SGPR_COUNT in
14943 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx11-table`.
14944 ``.amdhsa_reserve_xnack_mask`` Target GFX8-GFX10 Whether the kernel may trigger XNACK replay.
14945 Feature Used to calculate GRANULATED_WAVEFRONT_SGPR_COUNT in
14946 Specific :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx11-table`.
14948 ``.amdhsa_float_round_mode_32`` 0 GFX6-GFX11 Controls FLOAT_ROUND_MODE_32 in
14949 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx11-table`.
14950 Possible values are defined in
14951 :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
14952 ``.amdhsa_float_round_mode_16_64`` 0 GFX6-GFX11 Controls FLOAT_ROUND_MODE_16_64 in
14953 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx11-table`.
14954 Possible values are defined in
14955 :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
14956 ``.amdhsa_float_denorm_mode_32`` 0 GFX6-GFX11 Controls FLOAT_DENORM_MODE_32 in
14957 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx11-table`.
14958 Possible values are defined in
14959 :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
14960 ``.amdhsa_float_denorm_mode_16_64`` 3 GFX6-GFX11 Controls FLOAT_DENORM_MODE_16_64 in
14961 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx11-table`.
14962 Possible values are defined in
14963 :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
14964 ``.amdhsa_dx10_clamp`` 1 GFX6-GFX11 Controls ENABLE_DX10_CLAMP in
14965 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx11-table`.
14966 ``.amdhsa_ieee_mode`` 1 GFX6-GFX11 Controls ENABLE_IEEE_MODE in
14967 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx11-table`.
14968 ``.amdhsa_fp16_overflow`` 0 GFX9-GFX11 Controls FP16_OVFL in
14969 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx11-table`.
14970 ``.amdhsa_tg_split`` Target GFX90A, Controls TG_SPLIT in
14971 Feature GFX940, :ref:`amdgpu-amdhsa-compute_pgm_rsrc3-gfx90a-table`.
14974 ``.amdhsa_workgroup_processor_mode`` Target GFX10-GFX11 Controls ENABLE_WGP_MODE in
14975 Feature :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
14978 ``.amdhsa_memory_ordered`` 1 GFX10-GFX11 Controls MEM_ORDERED in
14979 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx11-table`.
14980 ``.amdhsa_forward_progress`` 0 GFX10-GFX11 Controls FWD_PROGRESS in
14981 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx11-table`.
14982 ``.amdhsa_shared_vgpr_count`` 0 GFX10-GFX11 Controls SHARED_VGPR_COUNT in
14983 :ref:`amdgpu-amdhsa-compute_pgm_rsrc3-gfx10-gfx11-table`.
14984 ``.amdhsa_exception_fp_ieee_invalid_op`` 0 GFX6-GFX11 Controls ENABLE_EXCEPTION_IEEE_754_FP_INVALID_OPERATION in
14985 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx11-table`.
14986 ``.amdhsa_exception_fp_denorm_src`` 0 GFX6-GFX11 Controls ENABLE_EXCEPTION_FP_DENORMAL_SOURCE in
14987 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx11-table`.
14988 ``.amdhsa_exception_fp_ieee_div_zero`` 0 GFX6-GFX11 Controls ENABLE_EXCEPTION_IEEE_754_FP_DIVISION_BY_ZERO in
14989 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx11-table`.
14990 ``.amdhsa_exception_fp_ieee_overflow`` 0 GFX6-GFX11 Controls ENABLE_EXCEPTION_IEEE_754_FP_OVERFLOW in
14991 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx11-table`.
14992 ``.amdhsa_exception_fp_ieee_underflow`` 0 GFX6-GFX11 Controls ENABLE_EXCEPTION_IEEE_754_FP_UNDERFLOW in
14993 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx11-table`.
14994 ``.amdhsa_exception_fp_ieee_inexact`` 0 GFX6-GFX11 Controls ENABLE_EXCEPTION_IEEE_754_FP_INEXACT in
14995 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx11-table`.
14996 ``.amdhsa_exception_int_div_zero`` 0 GFX6-GFX11 Controls ENABLE_EXCEPTION_INT_DIVIDE_BY_ZERO in
14997 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx11-table`.
14998 ======================================================== =================== ============ ===================
15003 Optional directive which declares the contents of the ``NT_AMDGPU_METADATA``
15004 note record (see :ref:`amdgpu-elf-note-records-table-v3-onwards`).
15006 The contents must be in the [YAML]_ markup format, with the same structure and
15007 semantics described in :ref:`amdgpu-amdhsa-code-object-metadata-v3`,
15008 :ref:`amdgpu-amdhsa-code-object-metadata-v4` or
15009 :ref:`amdgpu-amdhsa-code-object-metadata-v5`.
15011 This directive is terminated by an ``.end_amdgpu_metadata`` directive.
15013 .. _amdgpu-amdhsa-assembler-example-v3-onwards:
15015 Code Object V3 and Above Example Source Code
15016 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
15018 Here is an example of a minimal assembly source file, defining one HSA kernel:
15023 .amdgcn_target "amdgcn-amd-amdhsa--gfx900+xnack" // optional
15028 .type hello_world,@function
15030 s_load_dwordx2 s[0:1], s[0:1] 0x0
15031 v_mov_b32 v0, 3.14159
15032 s_waitcnt lgkmcnt(0)
15035 flat_store_dword v[1:2], v0
15038 .size hello_world, .Lfunc_end0-hello_world
15042 .amdhsa_kernel hello_world
15043 .amdhsa_user_sgpr_kernarg_segment_ptr 1
15044 .amdhsa_next_free_vgpr .amdgcn.next_free_vgpr
15045 .amdhsa_next_free_sgpr .amdgcn.next_free_sgpr
15054 - .name: hello_world
15055 .symbol: hello_world.kd
15056 .kernarg_segment_size: 48
15057 .group_segment_fixed_size: 0
15058 .private_segment_fixed_size: 0
15059 .kernarg_segment_align: 4
15060 .wavefront_size: 64
15063 .max_flat_workgroup_size: 256
15067 .value_kind: global_buffer
15068 .address_space: global
15069 .actual_access: write_only
15071 .end_amdgpu_metadata
15073 This kernel is equivalent to the following HIP program:
15078 __global__ void hello_world(float *p) {
15082 If an assembly source file contains multiple kernels and/or functions, the
15083 :ref:`amdgpu-amdhsa-assembler-symbol-next_free_vgpr` and
15084 :ref:`amdgpu-amdhsa-assembler-symbol-next_free_sgpr` symbols may be reset using
15085 the ``.set <symbol>, <expression>`` directive. For example, in the case of two
15086 kernels, where ``function1`` is only called from ``kernel1`` it is sufficient
15087 to group the function with the kernel that calls it and reset the symbols
15088 between the two connected components:
15093 .amdgcn_target "amdgcn-amd-amdhsa--gfx900+xnack" // optional
15095 // gpr tracking symbols are implicitly set to zero
15100 .type kern0,@function
15105 .size kern0, .Lkern0_end-kern0
15109 .amdhsa_kernel kern0
15111 .amdhsa_next_free_vgpr .amdgcn.next_free_vgpr
15112 .amdhsa_next_free_sgpr .amdgcn.next_free_sgpr
15115 // reset symbols to begin tracking usage in func1 and kern1
15116 .set .amdgcn.next_free_vgpr, 0
15117 .set .amdgcn.next_free_sgpr, 0
15123 .type func1,@function
15126 s_setpc_b64 s[30:31]
15128 .size func1, .Lfunc1_end-func1
15132 .type kern1,@function
15136 s_add_u32 s4, s4, func1@rel32@lo+4
15137 s_addc_u32 s5, s5, func1@rel32@lo+4
15138 s_swappc_b64 s[30:31], s[4:5]
15142 .size kern1, .Lkern1_end-kern1
15146 .amdhsa_kernel kern1
15148 .amdhsa_next_free_vgpr .amdgcn.next_free_vgpr
15149 .amdhsa_next_free_sgpr .amdgcn.next_free_sgpr
15152 These symbols cannot identify connected components in order to automatically
15153 track the usage for each kernel. However, in some cases careful organization of
15154 the kernels and functions in the source file means there is minimal additional
15155 effort required to accurately calculate GPR usage.
15157 Additional Documentation
15158 ========================
15160 .. [AMD-GCN-GFX6] `AMD Southern Islands Series ISA <http://developer.amd.com/wordpress/media/2012/12/AMD_Southern_Islands_Instruction_Set_Architecture.pdf>`__
15161 .. [AMD-GCN-GFX7] `AMD Sea Islands Series ISA <http://developer.amd.com/wordpress/media/2013/07/AMD_Sea_Islands_Instruction_Set_Architecture.pdf>`_
15162 .. [AMD-GCN-GFX8] `AMD GCN3 Instruction Set Architecture <http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/12/AMD_GCN3_Instruction_Set_Architecture_rev1.1.pdf>`__
15163 .. [AMD-GCN-GFX900-GFX904-VEGA] `AMD Vega Instruction Set Architecture <http://developer.amd.com/wordpress/media/2013/12/Vega_Shader_ISA_28July2017.pdf>`__
15164 .. [AMD-GCN-GFX906-VEGA7NM] `AMD Vega 7nm Instruction Set Architecture <https://gpuopen.com/wp-content/uploads/2019/11/Vega_7nm_Shader_ISA_26November2019.pdf>`__
15165 .. [AMD-GCN-GFX908-CDNA1] `AMD Instinct MI100 Instruction Set Architecture <https://developer.amd.com/wp-content/resources/CDNA1_Shader_ISA_14December2020.pdf>`__
15166 .. [AMD-GCN-GFX90A-CDNA2] `AMD Instinct MI200 Instruction Set Architecture <https://developer.amd.com/wp-content/resources/CDNA2_Shader_ISA_4February2022.pdf>`__
15167 .. [AMD-GCN-GFX10-RDNA1] `AMD RDNA 1.0 Instruction Set Architecture <https://gpuopen.com/wp-content/uploads/2019/08/RDNA_Shader_ISA_5August2019.pdf>`__
15168 .. [AMD-GCN-GFX10-RDNA2] `AMD RDNA 2 Instruction Set Architecture <https://developer.amd.com/wp-content/resources/RDNA2_Shader_ISA_November2020.pdf>`__
15169 .. [AMD-GCN-GFX11-RDNA3] `AMD RDNA 3 Instruction Set Architecture <https://developer.amd.com/wp-content/resources/RDNA3_Shader_ISA_December2022.pdf>`__
15170 .. [AMD-RADEON-HD-2000-3000] `AMD R6xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R600_Instruction_Set_Architecture.pdf>`__
15171 .. [AMD-RADEON-HD-4000] `AMD R7xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R700-Family_Instruction_Set_Architecture.pdf>`__
15172 .. [AMD-RADEON-HD-5000] `AMD Evergreen shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_Evergreen-Family_Instruction_Set_Architecture.pdf>`__
15173 .. [AMD-RADEON-HD-6000] `AMD Cayman/Trinity shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_HD_6900_Series_Instruction_Set_Architecture.pdf>`__
15174 .. [AMD-ROCm] `AMD ROCm™ Platform <https://rocmdocs.amd.com/>`__
15175 .. [AMD-ROCm-github] `AMD ROCm™ github <http://github.com/RadeonOpenCompute>`__
15176 .. [AMD-ROCm-Release-Notes] `AMD ROCm Release Notes <https://github.com/RadeonOpenCompute/ROCm>`__
15177 .. [CLANG-ATTR] `Attributes in Clang <https://clang.llvm.org/docs/AttributeReference.html>`__
15178 .. [DWARF] `DWARF Debugging Information Format <http://dwarfstd.org/>`__
15179 .. [ELF] `Executable and Linkable Format (ELF) <http://www.sco.com/developers/gabi/>`__
15180 .. [HRF] `Heterogeneous-race-free Memory Models <http://benedictgaster.org/wp-content/uploads/2014/01/asplos269-FINAL.pdf>`__
15181 .. [HSA] `Heterogeneous System Architecture (HSA) Foundation <http://www.hsafoundation.com/>`__
15182 .. [MsgPack] `Message Pack <http://www.msgpack.org/>`__
15183 .. [OpenCL] `The OpenCL Specification Version 2.0 <http://www.khronos.org/registry/cl/specs/opencl-2.0.pdf>`__
15184 .. [SEMVER] `Semantic Versioning <https://semver.org/>`__
15185 .. [YAML] `YAML Ain't Markup Language (YAML™) Version 1.2 <http://www.yaml.org/spec/1.2/spec.html>`__