1 =============================
2 User Guide for AMDGPU Backend
3 =============================
14 AMDGPU/AMDGPUAsmGFX900
15 AMDGPU/AMDGPUAsmGFX904
16 AMDGPU/AMDGPUAsmGFX906
17 AMDGPU/AMDGPUAsmGFX908
18 AMDGPU/AMDGPUAsmGFX90a
20 AMDGPU/AMDGPUAsmGFX1011
23 AMDGPUInstructionSyntax
24 AMDGPUInstructionNotation
25 AMDGPUDwarfExtensionsForHeterogeneousDebugging
26 AMDGPUDwarfExtensionAllowLocationDescriptionOnTheDwarfExpressionStack/AMDGPUDwarfExtensionAllowLocationDescriptionOnTheDwarfExpressionStack
31 The AMDGPU backend provides ISA code generation for AMD GPUs, starting with the
32 R600 family up until the current GCN families. It lives in the
33 ``llvm/lib/Target/AMDGPU`` directory.
38 .. _amdgpu-target-triples:
43 Use the Clang option ``-target <Architecture>-<Vendor>-<OS>-<Environment>``
44 to specify the target triple:
46 .. table:: AMDGPU Architectures
47 :name: amdgpu-architecture-table
49 ============ ==============================================================
50 Architecture Description
51 ============ ==============================================================
52 ``r600`` AMD GPUs HD2XXX-HD6XXX for graphics and compute shaders.
53 ``amdgcn`` AMD GPUs GCN GFX6 onwards for graphics and compute shaders.
54 ============ ==============================================================
56 .. table:: AMDGPU Vendors
57 :name: amdgpu-vendor-table
59 ============ ==============================================================
61 ============ ==============================================================
62 ``amd`` Can be used for all AMD GPU usage.
63 ``mesa3d`` Can be used if the OS is ``mesa3d``.
64 ============ ==============================================================
66 .. table:: AMDGPU Operating Systems
69 ============== ============================================================
71 ============== ============================================================
72 *<empty>* Defaults to the *unknown* OS.
73 ``amdhsa`` Compute kernels executed on HSA [HSA]_ compatible runtimes
76 - AMD's ROCm™ runtime [AMD-ROCm]_ using the *rocm-amdhsa*
77 loader on Linux. See *AMD ROCm Platform Release Notes*
78 [AMD-ROCm-Release-Notes]_ for supported hardware and
80 - AMD's PAL runtime using the *pal-amdhsa* loader on
83 ``amdpal`` Graphic shaders and compute kernels executed on AMD's PAL
84 runtime using the *pal-amdpal* loader on Windows and Linux
86 ``mesa3d`` Graphic shaders and compute kernels executed on AMD's Mesa
87 3D runtime using the *mesa-mesa3d* loader on Linux.
88 ============== ============================================================
90 .. table:: AMDGPU Environments
91 :name: amdgpu-environment-table
93 ============ ==============================================================
94 Environment Description
95 ============ ==============================================================
97 ============ ==============================================================
99 .. _amdgpu-processors:
104 Use the Clang options ``-mcpu=<target-id>`` or ``--offload-arch=<target-id>`` to
105 specify the AMDGPU processor together with optional target features. See
106 :ref:`amdgpu-target-id` and :ref:`amdgpu-target-features` for AMD GPU target
107 specific information.
109 Every processor supports every OS ABI (see :ref:`amdgpu-os`) with the following exceptions:
111 * ``amdhsa`` is not supported in ``r600`` architecture (see :ref:`amdgpu-architecture-table`).
114 .. table:: AMDGPU Processors
115 :name: amdgpu-processor-table
117 =========== =============== ============ ===== ================= =============== =============== ======================
118 Processor Alternative Target dGPU/ Target Target OS Support Example
119 Processor Triple APU Features Properties *(see* Products
120 Architecture Supported `amdgpu-os`_
129 =========== =============== ============ ===== ================= =============== =============== ======================
130 **Radeon HD 2000/3000 Series (R600)** [AMD-RADEON-HD-2000-3000]_
131 -----------------------------------------------------------------------------------------------------------------------
132 ``r600`` ``r600`` dGPU - Does not
137 ``r630`` ``r600`` dGPU - Does not
142 ``rs880`` ``r600`` dGPU - Does not
147 ``rv670`` ``r600`` dGPU - Does not
152 **Radeon HD 4000 Series (R700)** [AMD-RADEON-HD-4000]_
153 -----------------------------------------------------------------------------------------------------------------------
154 ``rv710`` ``r600`` dGPU - Does not
159 ``rv730`` ``r600`` dGPU - Does not
164 ``rv770`` ``r600`` dGPU - Does not
169 **Radeon HD 5000 Series (Evergreen)** [AMD-RADEON-HD-5000]_
170 -----------------------------------------------------------------------------------------------------------------------
171 ``cedar`` ``r600`` dGPU - Does not
176 ``cypress`` ``r600`` dGPU - Does not
181 ``juniper`` ``r600`` dGPU - Does not
186 ``redwood`` ``r600`` dGPU - Does not
191 ``sumo`` ``r600`` dGPU - Does not
196 **Radeon HD 6000 Series (Northern Islands)** [AMD-RADEON-HD-6000]_
197 -----------------------------------------------------------------------------------------------------------------------
198 ``barts`` ``r600`` dGPU - Does not
203 ``caicos`` ``r600`` dGPU - Does not
208 ``cayman`` ``r600`` dGPU - Does not
213 ``turks`` ``r600`` dGPU - Does not
218 **GCN GFX6 (Southern Islands (SI))** [AMD-GCN-GFX6]_
219 -----------------------------------------------------------------------------------------------------------------------
220 ``gfx600`` - ``tahiti`` ``amdgcn`` dGPU - Does not - *pal-amdpal*
225 ``gfx601`` - ``pitcairn`` ``amdgcn`` dGPU - Does not - *pal-amdpal*
230 ``gfx602`` - ``hainan`` ``amdgcn`` dGPU - Does not - *pal-amdpal*
235 **GCN GFX7 (Sea Islands (CI))** [AMD-GCN-GFX7]_
236 -----------------------------------------------------------------------------------------------------------------------
237 ``gfx700`` - ``kaveri`` ``amdgcn`` APU - Offset - *rocm-amdhsa* - A6-7000
238 flat - *pal-amdhsa* - A6 Pro-7050B
239 scratch - *pal-amdpal* - A8-7100
247 ``gfx701`` - ``hawaii`` ``amdgcn`` dGPU - Offset - *rocm-amdhsa* - FirePro W8100
248 flat - *pal-amdhsa* - FirePro W9100
249 scratch - *pal-amdpal* - FirePro S9150
251 ``gfx702`` ``amdgcn`` dGPU - Offset - *rocm-amdhsa* - Radeon R9 290
252 flat - *pal-amdhsa* - Radeon R9 290x
253 scratch - *pal-amdpal* - Radeon R390
255 ``gfx703`` - ``kabini`` ``amdgcn`` APU - Offset - *pal-amdhsa* - E1-2100
256 - ``mullins`` flat - *pal-amdpal* - E1-2200
264 ``gfx704`` - ``bonaire`` ``amdgcn`` dGPU - Offset - *pal-amdhsa* - Radeon HD 7790
265 flat - *pal-amdpal* - Radeon HD 8770
268 ``gfx705`` ``amdgcn`` APU - Offset - *pal-amdhsa* *TBA*
275 **GCN GFX8 (Volcanic Islands (VI))** [AMD-GCN-GFX8]_
276 -----------------------------------------------------------------------------------------------------------------------
277 ``gfx801`` - ``carrizo`` ``amdgcn`` APU - xnack - Offset - *rocm-amdhsa* - A6-8500P
278 flat - *pal-amdhsa* - Pro A6-8500B
279 scratch - *pal-amdpal* - A8-8600P
295 ``gfx802`` - ``iceland`` ``amdgcn`` dGPU - Offset - *rocm-amdhsa* - Radeon R9 285
296 - ``tonga`` flat - *pal-amdhsa* - Radeon R9 380
297 scratch - *pal-amdpal* - Radeon R9 385
298 ``gfx803`` - ``fiji`` ``amdgcn`` dGPU - *rocm-amdhsa* - Radeon R9 Nano
299 - *pal-amdhsa* - Radeon R9 Fury
300 - *pal-amdpal* - Radeon R9 FuryX
303 - Radeon Instinct MI8
304 \ - ``polaris10`` ``amdgcn`` dGPU - Offset - *rocm-amdhsa* - Radeon RX 470
305 flat - *pal-amdhsa* - Radeon RX 480
306 scratch - *pal-amdpal* - Radeon Instinct MI6
307 \ - ``polaris11`` ``amdgcn`` dGPU - Offset - *rocm-amdhsa* - Radeon RX 460
309 scratch - *pal-amdpal*
310 ``gfx805`` - ``tongapro`` ``amdgcn`` dGPU - Offset - *rocm-amdhsa* - FirePro S7150
311 flat - *pal-amdhsa* - FirePro S7100
312 scratch - *pal-amdpal* - FirePro W7100
315 ``gfx810`` - ``stoney`` ``amdgcn`` APU - xnack - Offset - *rocm-amdhsa* *TBA*
317 scratch - *pal-amdpal* .. TODO::
322 **GCN GFX9 (Vega)** [AMD-GCN-GFX900-GFX904-VEGA]_ [AMD-GCN-GFX906-VEGA7NM]_ [AMD-GCN-GFX908-CDNA1]_
323 -----------------------------------------------------------------------------------------------------------------------
324 ``gfx900`` ``amdgcn`` dGPU - xnack - Absolute - *rocm-amdhsa* - Radeon Vega
325 flat - *pal-amdhsa* Frontier Edition
326 scratch - *pal-amdpal* - Radeon RX Vega 56
330 - Radeon Instinct MI25
331 ``gfx902`` ``amdgcn`` APU - xnack - Absolute - *rocm-amdhsa* - Ryzen 3 2200G
332 flat - *pal-amdhsa* - Ryzen 5 2400G
333 scratch - *pal-amdpal*
334 ``gfx904`` ``amdgcn`` dGPU - xnack - *rocm-amdhsa* *TBA*
336 - *pal-amdpal* .. TODO::
341 ``gfx906`` ``amdgcn`` dGPU - sramecc - Absolute - *rocm-amdhsa* - Radeon Instinct MI50
342 - xnack flat - *pal-amdhsa* - Radeon Instinct MI60
343 scratch - *pal-amdpal* - Radeon VII
345 ``gfx908`` ``amdgcn`` dGPU - sramecc - *rocm-amdhsa* - AMD Instinct MI100 Accelerator
349 ``gfx909`` ``amdgcn`` APU - xnack - Absolute - *pal-amdpal* *TBA*
356 ``gfx90a`` ``amdgcn`` dGPU - sramecc - Absolute - *rocm-amdhsa* *TBA*
358 - xnack scratch .. TODO::
360 work-item Add product
363 ``gfx90c`` ``amdgcn`` APU - xnack - Absolute - *pal-amdpal* - Ryzen 7 4700G
364 flat - Ryzen 7 4700GE
365 scratch - Ryzen 5 4600G
377 **GCN GFX10.1 (RDNA 1)** [AMD-GCN-GFX10-RDNA1]_
378 -----------------------------------------------------------------------------------------------------------------------
379 ``gfx1010`` ``amdgcn`` dGPU - cumode - Absolute - *rocm-amdhsa* - Radeon RX 5700
380 - wavefrontsize64 flat - *pal-amdhsa* - Radeon RX 5700 XT
381 - xnack scratch - *pal-amdpal* - Radeon Pro 5600 XT
383 ``gfx1011`` ``amdgcn`` dGPU - cumode - *rocm-amdhsa* - Radeon Pro V520
384 - wavefrontsize64 - Absolute - *pal-amdhsa*
385 - xnack flat - *pal-amdpal*
387 ``gfx1012`` ``amdgcn`` dGPU - cumode - Absolute - *rocm-amdhsa* - Radeon RX 5500
388 - wavefrontsize64 flat - *pal-amdhsa* - Radeon RX 5500 XT
389 - xnack scratch - *pal-amdpal*
390 ``gfx1013`` ``amdgcn`` APU - cumode - Absolute - *rocm-amdhsa* *TBA*
391 - wavefrontsize64 flat - *pal-amdhsa*
392 - xnack scratch - *pal-amdpal* .. TODO::
397 **GCN GFX10.3 (RDNA 2)** [AMD-GCN-GFX10-RDNA2]_
398 -----------------------------------------------------------------------------------------------------------------------
399 ``gfx1030`` ``amdgcn`` dGPU - cumode - Absolute - *rocm-amdhsa* - Radeon RX 6800
400 - wavefrontsize64 flat - *pal-amdhsa* - Radeon RX 6800 XT
401 scratch - *pal-amdpal* - Radeon RX 6900 XT
402 ``gfx1031`` ``amdgcn`` dGPU - cumode - Absolute - *rocm-amdhsa* - Radeon RX 6700 XT
403 - wavefrontsize64 flat - *pal-amdhsa*
404 scratch - *pal-amdpal*
405 ``gfx1032`` ``amdgcn`` dGPU - cumode - Absolute - *rocm-amdhsa* *TBA*
406 - wavefrontsize64 flat - *pal-amdhsa*
407 scratch - *pal-amdpal* .. TODO::
412 ``gfx1033`` ``amdgcn`` APU - cumode - Absolute - *pal-amdpal* *TBA*
413 - wavefrontsize64 flat
418 ``gfx1034`` ``amdgcn`` dGPU - cumode - Absolute - *pal-amdpal* *TBA*
419 - wavefrontsize64 flat
425 ``gfx1035`` ``amdgcn`` APU - cumode - Absolute - *pal-amdpal* *TBA*
426 - wavefrontsize64 flat
431 =========== =============== ============ ===== ================= =============== =============== ======================
433 .. _amdgpu-target-features:
438 Target features control how code is generated to support certain
439 processor specific features. Not all target features are supported by
440 all processors. The runtime must ensure that the features supported by
441 the device used to execute the code match the features enabled when
442 generating the code. A mismatch of features may result in incorrect
443 execution, or a reduction in performance.
445 The target features supported by each processor is listed in
446 :ref:`amdgpu-processor-table`.
448 Target features are controlled by exactly one of the following Clang
451 ``-mcpu=<target-id>`` or ``--offload-arch=<target-id>``
453 The ``-mcpu`` and ``--offload-arch`` can specify the target feature as
454 optional components of the target ID. If omitted, the target feature has the
455 ``any`` value. See :ref:`amdgpu-target-id`.
457 ``-m[no-]<target-feature>``
459 Target features not specified by the target ID are specified using a
460 separate option. These target features can have an ``on`` or ``off``
461 value. ``on`` is specified by omitting the ``no-`` prefix, and
462 ``off`` is specified by including the ``no-`` prefix. The default
463 if not specified is ``off``.
467 ``-mcpu=gfx908:xnack+``
468 Enable the ``xnack`` feature.
469 ``-mcpu=gfx908:xnack-``
470 Disable the ``xnack`` feature.
472 Enable the ``cumode`` feature.
474 Disable the ``cumode`` feature.
476 .. table:: AMDGPU Target Features
477 :name: amdgpu-target-features-table
479 =============== ============================ ==================================================
480 Target Feature Clang Option to Control Description
482 =============== ============================ ==================================================
483 cumode - ``-m[no-]cumode`` Control the wavefront execution mode used
484 when generating code for kernels. When disabled
485 native WGP wavefront execution mode is used,
486 when enabled CU wavefront execution mode is used
487 (see :ref:`amdgpu-amdhsa-memory-model`).
489 sramecc - ``-mcpu`` If specified, generate code that can only be
490 - ``--offload-arch`` loaded and executed in a process that has a
491 matching setting for SRAMECC.
493 If not specified for code object V2 to V3, generate
494 code that can be loaded and executed in a process
495 with SRAMECC enabled.
497 If not specified for code object V4, generate
498 code that can be loaded and executed in a process
499 with either setting of SRAMECC.
501 tgsplit ``-m[no-]tgsplit`` Enable/disable generating code that assumes
502 work-groups are launched in threadgroup split mode.
503 When enabled the waves of a work-group may be
504 launched in different CUs.
506 wavefrontsize64 - ``-m[no-]wavefrontsize64`` Control the wavefront size used when
507 generating code for kernels. When disabled
508 native wavefront size 32 is used, when enabled
509 wavefront size 64 is used.
511 xnack - ``-mcpu`` If specified, generate code that can only be
512 - ``--offload-arch`` loaded and executed in a process that has a
513 matching setting for XNACK replay.
515 If not specified for code object V2 to V3, generate
516 code that can be loaded and executed in a process
517 with XNACK replay enabled.
519 If not specified for code object V4, generate
520 code that can be loaded and executed in a process
521 with either setting of XNACK replay.
523 XNACK replay can be used for demand paging and
524 page migration. If enabled in the device, then if
525 a page fault occurs the code may execute
526 incorrectly unless generated with XNACK replay
527 enabled, or generated for code object V4 without
528 specifying XNACK replay. Executing code that was
529 generated with XNACK replay enabled, or generated
530 for code object V4 without specifying XNACK replay,
531 on a device that does not have XNACK replay
532 enabled will execute correctly but may be less
533 performant than code generated for XNACK replay
535 =============== ============================ ==================================================
537 .. _amdgpu-target-id:
542 AMDGPU supports target IDs. See `Clang Offload Bundler
543 <https://clang.llvm.org/docs/ClangOffloadBundler.html>`_ for a general
544 description. The AMDGPU target specific information is:
547 Is an AMDGPU processor or alternative processor name specified in
548 :ref:`amdgpu-processor-table`. The non-canonical form target ID allows both
549 the primary processor and alternative processor names. The canonical form
550 target ID only allow the primary processor name.
553 Is a target feature name specified in :ref:`amdgpu-target-features-table` that
554 is supported by the processor. The target features supported by each processor
555 is specified in :ref:`amdgpu-processor-table`. Those that can be specified in
556 a target ID are marked as being controlled by ``-mcpu`` and
557 ``--offload-arch``. Each target feature must appear at most once in a target
558 ID. The non-canonical form target ID allows the target features to be
559 specified in any order. The canonical form target ID requires the target
560 features to be specified in alphabetic order.
562 .. _amdgpu-target-id-v2-v3:
564 Code Object V2 to V3 Target ID
565 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
567 The target ID syntax for code object V2 to V3 is the same as defined in `Clang
568 Offload Bundler <https://clang.llvm.org/docs/ClangOffloadBundler.html>`_ except
569 when used in the :ref:`amdgpu-assembler-directive-amdgcn-target` assembler
570 directive and the bundle entry ID. In those cases it has the following BNF
575 <target-id> ::== <processor> ( "+" <target-feature> )*
577 Where a target feature is omitted if *Off* and present if *On* or *Any*.
581 The code object V2 to V3 cannot represent *Any* and treats it the same as
584 .. _amdgpu-embedding-bundled-objects:
586 Embedding Bundled Code Objects
587 ------------------------------
589 AMDGPU supports the HIP and OpenMP languages that perform code object embedding
590 as described in `Clang Offload Bundler
591 <https://clang.llvm.org/docs/ClangOffloadBundler.html>`_.
595 The target ID syntax used for code object V2 to V3 for a bundle entry ID
596 differs from that used elsewhere. See :ref:`amdgpu-target-id-v2-v3`.
598 .. _amdgpu-address-spaces:
603 The AMDGPU architecture supports a number of memory address spaces. The address
604 space names use the OpenCL standard names, with some additions.
606 The AMDGPU address spaces correspond to target architecture specific LLVM
607 address space numbers used in LLVM IR.
609 The AMDGPU address spaces are described in
610 :ref:`amdgpu-address-spaces-table`. Only 64-bit process address spaces are
611 supported for the ``amdgcn`` target.
613 .. table:: AMDGPU Address Spaces
614 :name: amdgpu-address-spaces-table
616 ================================= =============== =========== ================ ======= ============================
617 .. 64-Bit Process Address Space
618 --------------------------------- --------------- ----------- ---------------- ------------------------------------
619 Address Space Name LLVM IR Address HSA Segment Hardware Address NULL Value
620 Space Number Name Name Size
621 ================================= =============== =========== ================ ======= ============================
622 Generic 0 flat flat 64 0x0000000000000000
623 Global 1 global global 64 0x0000000000000000
624 Region 2 N/A GDS 32 *not implemented for AMDHSA*
625 Local 3 group LDS 32 0xFFFFFFFF
626 Constant 4 constant *same as global* 64 0x0000000000000000
627 Private 5 private scratch 32 0xFFFFFFFF
628 Constant 32-bit 6 *TODO* 0x00000000
629 Buffer Fat Pointer (experimental) 7 *TODO*
630 ================================= =============== =========== ================ ======= ============================
633 The generic address space is supported unless the *Target Properties* column
634 of :ref:`amdgpu-processor-table` specifies *Does not support generic address
637 The generic address space uses the hardware flat address support for two fixed
638 ranges of virtual addresses (the private and local apertures), that are
639 outside the range of addressable global memory, to map from a flat address to
640 a private or local address. This uses FLAT instructions that can take a flat
641 address and access global, private (scratch), and group (LDS) memory depending
642 on if the address is within one of the aperture ranges.
644 Flat access to scratch requires hardware aperture setup and setup in the
645 kernel prologue (see :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`). Flat
646 access to LDS requires hardware aperture setup and M0 (GFX7-GFX8) register
647 setup (see :ref:`amdgpu-amdhsa-kernel-prolog-m0`).
649 To convert between a private or group address space address (termed a segment
650 address) and a flat address the base address of the corresponding aperture
651 can be used. For GFX7-GFX8 these are available in the
652 :ref:`amdgpu-amdhsa-hsa-aql-queue` the address of which can be obtained with
653 Queue Ptr SGPR (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). For
654 GFX9-GFX10 the aperture base addresses are directly available as inline
655 constant registers ``SRC_SHARED_BASE/LIMIT`` and ``SRC_PRIVATE_BASE/LIMIT``.
656 In 64-bit address mode the aperture sizes are 2^32 bytes and the base is
657 aligned to 2^32 which makes it easier to convert from flat to segment or
660 A global address space address has the same value when used as a flat address
661 so no conversion is needed.
663 **Global and Constant**
664 The global and constant address spaces both use global virtual addresses,
665 which are the same virtual address space used by the CPU. However, some
666 virtual addresses may only be accessible to the CPU, some only accessible
667 by the GPU, and some by both.
669 Using the constant address space indicates that the data will not change
670 during the execution of the kernel. This allows scalar read instructions to
671 be used. As the constant address space could only be modified on the host
672 side, a generic pointer loaded from the constant address space is safe to be
673 assumed as a global pointer since only the device global memory is visible
674 and managed on the host side. The vector and scalar L1 caches are invalidated
675 of volatile data before each kernel dispatch execution to allow constant
676 memory to change values between kernel dispatches.
679 The region address space uses the hardware Global Data Store (GDS). All
680 wavefronts executing on the same device will access the same memory for any
681 given region address. However, the same region address accessed by wavefronts
682 executing on different devices will access different memory. It is higher
683 performance than global memory. It is allocated by the runtime. The data
684 store (DS) instructions can be used to access it.
687 The local address space uses the hardware Local Data Store (LDS) which is
688 automatically allocated when the hardware creates the wavefronts of a
689 work-group, and freed when all the wavefronts of a work-group have
690 terminated. All wavefronts belonging to the same work-group will access the
691 same memory for any given local address. However, the same local address
692 accessed by wavefronts belonging to different work-groups will access
693 different memory. It is higher performance than global memory. The data store
694 (DS) instructions can be used to access it.
697 The private address space uses the hardware scratch memory support which
698 automatically allocates memory when it creates a wavefront and frees it when
699 a wavefronts terminates. The memory accessed by a lane of a wavefront for any
700 given private address will be different to the memory accessed by another lane
701 of the same or different wavefront for the same private address.
703 If a kernel dispatch uses scratch, then the hardware allocates memory from a
704 pool of backing memory allocated by the runtime for each wavefront. The lanes
705 of the wavefront access this using dword (4 byte) interleaving. The mapping
706 used from private address to backing memory address is:
708 ``wavefront-scratch-base +
709 ((private-address / 4) * wavefront-size * 4) +
710 (wavefront-lane-id * 4) + (private-address % 4)``
712 If each lane of a wavefront accesses the same private address, the
713 interleaving results in adjacent dwords being accessed and hence requires
714 fewer cache lines to be fetched.
716 There are different ways that the wavefront scratch base address is
717 determined by a wavefront (see
718 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
720 Scratch memory can be accessed in an interleaved manner using buffer
721 instructions with the scratch buffer descriptor and per wavefront scratch
722 offset, by the scratch instructions, or by flat instructions. Multi-dword
723 access is not supported except by flat and scratch instructions in
729 **Buffer Fat Pointer**
730 The buffer fat pointer is an experimental address space that is currently
731 unsupported in the backend. It exposes a non-integral pointer that is in
732 the future intended to support the modelling of 128-bit buffer descriptors
733 plus a 32-bit offset into the buffer (in total encapsulating a 160-bit
734 *pointer*), allowing normal LLVM load/store/atomic operations to be used to
735 model the buffer descriptors used heavily in graphics workloads targeting
738 .. _amdgpu-memory-scopes:
743 This section provides LLVM memory synchronization scopes supported by the AMDGPU
744 backend memory model when the target triple OS is ``amdhsa`` (see
745 :ref:`amdgpu-amdhsa-memory-model` and :ref:`amdgpu-target-triples`).
747 The memory model supported is based on the HSA memory model [HSA]_ which is
748 based in turn on HRF-indirect with scope inclusion [HRF]_. The happens-before
749 relation is transitive over the synchronizes-with relation independent of scope
750 and synchronizes-with allows the memory scope instances to be inclusive (see
751 table :ref:`amdgpu-amdhsa-llvm-sync-scopes-table`).
753 This is different to the OpenCL [OpenCL]_ memory model which does not have scope
754 inclusion and requires the memory scopes to exactly match. However, this
755 is conservatively correct for OpenCL.
757 .. table:: AMDHSA LLVM Sync Scopes
758 :name: amdgpu-amdhsa-llvm-sync-scopes-table
760 ======================= ===================================================
761 LLVM Sync Scope Description
762 ======================= ===================================================
763 *none* The default: ``system``.
765 Synchronizes with, and participates in modification
766 and seq_cst total orderings with, other operations
767 (except image operations) for all address spaces
768 (except private, or generic that accesses private)
769 provided the other operation's sync scope is:
772 - ``agent`` and executed by a thread on the same
774 - ``workgroup`` and executed by a thread in the
776 - ``wavefront`` and executed by a thread in the
779 ``agent`` Synchronizes with, and participates in modification
780 and seq_cst total orderings with, other operations
781 (except image operations) for all address spaces
782 (except private, or generic that accesses private)
783 provided the other operation's sync scope is:
785 - ``system`` or ``agent`` and executed by a thread
787 - ``workgroup`` and executed by a thread in the
789 - ``wavefront`` and executed by a thread in the
792 ``workgroup`` Synchronizes with, and participates in modification
793 and seq_cst total orderings with, other operations
794 (except image operations) for all address spaces
795 (except private, or generic that accesses private)
796 provided the other operation's sync scope is:
798 - ``system``, ``agent`` or ``workgroup`` and
799 executed by a thread in the same work-group.
800 - ``wavefront`` and executed by a thread in the
803 ``wavefront`` Synchronizes with, and participates in modification
804 and seq_cst total orderings with, other operations
805 (except image operations) for all address spaces
806 (except private, or generic that accesses private)
807 provided the other operation's sync scope is:
809 - ``system``, ``agent``, ``workgroup`` or
810 ``wavefront`` and executed by a thread in the
813 ``singlethread`` Only synchronizes with and participates in
814 modification and seq_cst total orderings with,
815 other operations (except image operations) running
816 in the same thread for all address spaces (for
817 example, in signal handlers).
819 ``one-as`` Same as ``system`` but only synchronizes with other
820 operations within the same address space.
822 ``agent-one-as`` Same as ``agent`` but only synchronizes with other
823 operations within the same address space.
825 ``workgroup-one-as`` Same as ``workgroup`` but only synchronizes with
826 other operations within the same address space.
828 ``wavefront-one-as`` Same as ``wavefront`` but only synchronizes with
829 other operations within the same address space.
831 ``singlethread-one-as`` Same as ``singlethread`` but only synchronizes with
832 other operations within the same address space.
833 ======================= ===================================================
838 The AMDGPU backend implements the following LLVM IR intrinsics.
840 *This section is WIP.*
844 List AMDGPU intrinsics.
849 The AMDGPU backend supports the following LLVM IR attributes.
851 .. table:: AMDGPU LLVM IR Attributes
852 :name: amdgpu-llvm-ir-attributes-table
854 ======================================= ==========================================================
855 LLVM Attribute Description
856 ======================================= ==========================================================
857 "amdgpu-flat-work-group-size"="min,max" Specify the minimum and maximum flat work group sizes that
858 will be specified when the kernel is dispatched. Generated
859 by the ``amdgpu_flat_work_group_size`` CLANG attribute [CLANG-ATTR]_.
860 The implied default value is 1,1024.
862 "amdgpu-implicitarg-num-bytes"="n" Number of kernel argument bytes to add to the kernel
863 argument block size for the implicit arguments. This
864 varies by OS and language (for OpenCL see
865 :ref:`opencl-kernel-implicit-arguments-appended-for-amdhsa-os-table`).
866 "amdgpu-num-sgpr"="n" Specifies the number of SGPRs to use. Generated by
867 the ``amdgpu_num_sgpr`` CLANG attribute [CLANG-ATTR]_.
868 "amdgpu-num-vgpr"="n" Specifies the number of VGPRs to use. Generated by the
869 ``amdgpu_num_vgpr`` CLANG attribute [CLANG-ATTR]_.
870 "amdgpu-waves-per-eu"="m,n" Specify the minimum and maximum number of waves per
871 execution unit. Generated by the ``amdgpu_waves_per_eu``
872 CLANG attribute [CLANG-ATTR]_. This is an optimization hint,
873 and the backend may not be able to satisfy the request. If
874 the specified range is incompatible with the function's
875 "amdgpu-flat-work-group-size" value, the implied occupancy
876 bounds by the workgroup size takes precedence.
878 "amdgpu-ieee" true/false. Specify whether the function expects the IEEE field of the
879 mode register to be set on entry. Overrides the default for
880 the calling convention.
881 "amdgpu-dx10-clamp" true/false. Specify whether the function expects the DX10_CLAMP field of
882 the mode register to be set on entry. Overrides the default
883 for the calling convention.
885 "amdgpu-no-workitem-id-x" Indicates the function does not depend on the value of the
886 llvm.amdgcn.workitem.id.x intrinsic. If a function is marked with this
887 attribute, or reached through a call site marked with this attribute,
888 the value returned by the intrinsic is undefined. The backend can
889 generally infer this during code generation, so typically there is no
890 benefit to frontends marking functions with this.
892 "amdgpu-no-workitem-id-y" The same as amdgpu-no-workitem-id-x, except for the
893 llvm.amdgcn.workitem.id.y intrinsic.
895 "amdgpu-no-workitem-id-z" The same as amdgpu-no-workitem-id-x, except for the
896 llvm.amdgcn.workitem.id.z intrinsic.
898 "amdgpu-no-workgroup-id-x" The same as amdgpu-no-workitem-id-x, except for the
899 llvm.amdgcn.workgroup.id.x intrinsic.
901 "amdgpu-no-workgroup-id-y" The same as amdgpu-no-workitem-id-x, except for the
902 llvm.amdgcn.workgroup.id.y intrinsic.
904 "amdgpu-no-workgroup-id-z" The same as amdgpu-no-workitem-id-x, except for the
905 llvm.amdgcn.workgroup.id.z intrinsic.
907 "amdgpu-no-dispatch-ptr" The same as amdgpu-no-workitem-id-x, except for the
908 llvm.amdgcn.dispatch.ptr intrinsic.
910 "amdgpu-no-implicitarg-ptr" The same as amdgpu-no-workitem-id-x, except for the
911 llvm.amdgcn.implicitarg.ptr intrinsic.
913 "amdgpu-no-dispatch-id" The same as amdgpu-no-workitem-id-x, except for the
914 llvm.amdgcn.dispatch.id intrinsic.
916 "amdgpu-no-queue-ptr" Similar to amdgpu-no-workitem-id-x, except for the
917 llvm.amdgcn.queue.ptr intrinsic. Note that unlike the other ABI hint
918 attributes, the queue pointer may be required in situations where the
919 intrinsic call does not directly appear in the program. Some subtargets
920 require the queue pointer for to handle some addrspacecasts, as well
921 as the llvm.amdgcn.is.shared, llvm.amdgcn.is.private, llvm.trap, and
922 llvm.debug intrinsics.
924 ======================================= ==========================================================
926 .. _amdgpu-elf-code-object:
931 The AMDGPU backend generates a standard ELF [ELF]_ relocatable code object that
932 can be linked by ``lld`` to produce a standard ELF shared code object which can
933 be loaded and executed on an AMDGPU target.
935 .. _amdgpu-elf-header:
940 The AMDGPU backend uses the following ELF header:
942 .. table:: AMDGPU ELF Header
943 :name: amdgpu-elf-header-table
945 ========================== ===============================
947 ========================== ===============================
948 ``e_ident[EI_CLASS]`` ``ELFCLASS64``
949 ``e_ident[EI_DATA]`` ``ELFDATA2LSB``
950 ``e_ident[EI_OSABI]`` - ``ELFOSABI_NONE``
951 - ``ELFOSABI_AMDGPU_HSA``
952 - ``ELFOSABI_AMDGPU_PAL``
953 - ``ELFOSABI_AMDGPU_MESA3D``
954 ``e_ident[EI_ABIVERSION]`` - ``ELFABIVERSION_AMDGPU_HSA_V2``
955 - ``ELFABIVERSION_AMDGPU_HSA_V3``
956 - ``ELFABIVERSION_AMDGPU_HSA_V4``
957 - ``ELFABIVERSION_AMDGPU_PAL``
958 - ``ELFABIVERSION_AMDGPU_MESA3D``
959 ``e_type`` - ``ET_REL``
961 ``e_machine`` ``EM_AMDGPU``
963 ``e_flags`` See :ref:`amdgpu-elf-header-e_flags-v2-table`,
964 :ref:`amdgpu-elf-header-e_flags-table-v3`,
965 and :ref:`amdgpu-elf-header-e_flags-table-v4`
966 ========================== ===============================
970 .. table:: AMDGPU ELF Header Enumeration Values
971 :name: amdgpu-elf-header-enumeration-values-table
973 =============================== =====
975 =============================== =====
978 ``ELFOSABI_AMDGPU_HSA`` 64
979 ``ELFOSABI_AMDGPU_PAL`` 65
980 ``ELFOSABI_AMDGPU_MESA3D`` 66
981 ``ELFABIVERSION_AMDGPU_HSA_V2`` 0
982 ``ELFABIVERSION_AMDGPU_HSA_V3`` 1
983 ``ELFABIVERSION_AMDGPU_HSA_V4`` 2
984 ``ELFABIVERSION_AMDGPU_PAL`` 0
985 ``ELFABIVERSION_AMDGPU_MESA3D`` 0
986 =============================== =====
988 ``e_ident[EI_CLASS]``
991 * ``ELFCLASS32`` for ``r600`` architecture.
993 * ``ELFCLASS64`` for ``amdgcn`` architecture which only supports 64-bit
994 process address space applications.
997 All AMDGPU targets use ``ELFDATA2LSB`` for little-endian byte ordering.
999 ``e_ident[EI_OSABI]``
1000 One of the following AMDGPU target architecture specific OS ABIs
1001 (see :ref:`amdgpu-os`):
1003 * ``ELFOSABI_NONE`` for *unknown* OS.
1005 * ``ELFOSABI_AMDGPU_HSA`` for ``amdhsa`` OS.
1007 * ``ELFOSABI_AMDGPU_PAL`` for ``amdpal`` OS.
1009 * ``ELFOSABI_AMDGPU_MESA3D`` for ``mesa3D`` OS.
1011 ``e_ident[EI_ABIVERSION]``
1012 The ABI version of the AMDGPU target architecture specific OS ABI to which the code
1015 * ``ELFABIVERSION_AMDGPU_HSA_V2`` is used to specify the version of AMD HSA
1016 runtime ABI for code object V2. Specify using the Clang option
1017 ``-mcode-object-version=2``.
1019 * ``ELFABIVERSION_AMDGPU_HSA_V3`` is used to specify the version of AMD HSA
1020 runtime ABI for code object V3. Specify using the Clang option
1021 ``-mcode-object-version=3``. This is the default code object
1022 version if not specified.
1024 * ``ELFABIVERSION_AMDGPU_HSA_V4`` is used to specify the version of AMD HSA
1025 runtime ABI for code object V4. Specify using the Clang option
1026 ``-mcode-object-version=4``.
1028 * ``ELFABIVERSION_AMDGPU_PAL`` is used to specify the version of AMD PAL
1031 * ``ELFABIVERSION_AMDGPU_MESA3D`` is used to specify the version of AMD MESA
1035 Can be one of the following values:
1039 The type produced by the AMDGPU backend compiler as it is relocatable code
1043 The type produced by the linker as it is a shared code object.
1045 The AMD HSA runtime loader requires a ``ET_DYN`` code object.
1048 The value ``EM_AMDGPU`` is used for the machine for all processors supported
1049 by the ``r600`` and ``amdgcn`` architectures (see
1050 :ref:`amdgpu-processor-table`). The specific processor is specified in the
1051 ``NT_AMD_HSA_ISA_VERSION`` note record for code object V2 (see
1052 :ref:`amdgpu-note-records-v2`) and in the ``EF_AMDGPU_MACH`` bit field of the
1053 ``e_flags`` for code object V3 to V4 (see
1054 :ref:`amdgpu-elf-header-e_flags-table-v3` and
1055 :ref:`amdgpu-elf-header-e_flags-table-v4`).
1058 The entry point is 0 as the entry points for individual kernels must be
1059 selected in order to invoke them through AQL packets.
1062 The AMDGPU backend uses the following ELF header flags:
1064 .. table:: AMDGPU ELF Header ``e_flags`` for Code Object V2
1065 :name: amdgpu-elf-header-e_flags-v2-table
1067 ===================================== ===== =============================
1068 Name Value Description
1069 ===================================== ===== =============================
1070 ``EF_AMDGPU_FEATURE_XNACK_V2`` 0x01 Indicates if the ``xnack``
1072 enabled for all code
1073 contained in the code object.
1075 does not support the
1080 :ref:`amdgpu-target-features`.
1081 ``EF_AMDGPU_FEATURE_TRAP_HANDLER_V2`` 0x02 Indicates if the trap
1082 handler is enabled for all
1083 code contained in the code
1084 object. If the processor
1085 does not support a trap
1086 handler then must be 0.
1088 :ref:`amdgpu-target-features`.
1089 ===================================== ===== =============================
1091 .. table:: AMDGPU ELF Header ``e_flags`` for Code Object V3
1092 :name: amdgpu-elf-header-e_flags-table-v3
1094 ================================= ===== =============================
1095 Name Value Description
1096 ================================= ===== =============================
1097 ``EF_AMDGPU_MACH`` 0x0ff AMDGPU processor selection
1099 ``EF_AMDGPU_MACH_xxx`` values
1101 :ref:`amdgpu-ef-amdgpu-mach-table`.
1102 ``EF_AMDGPU_FEATURE_XNACK_V3`` 0x100 Indicates if the ``xnack``
1104 enabled for all code
1105 contained in the code object.
1107 does not support the
1112 :ref:`amdgpu-target-features`.
1113 ``EF_AMDGPU_FEATURE_SRAMECC_V3`` 0x200 Indicates if the ``sramecc``
1115 enabled for all code
1116 contained in the code object.
1118 does not support the
1123 :ref:`amdgpu-target-features`.
1124 ================================= ===== =============================
1126 .. table:: AMDGPU ELF Header ``e_flags`` for Code Object V4
1127 :name: amdgpu-elf-header-e_flags-table-v4
1129 ============================================ ===== ===================================
1130 Name Value Description
1131 ============================================ ===== ===================================
1132 ``EF_AMDGPU_MACH`` 0x0ff AMDGPU processor selection
1134 ``EF_AMDGPU_MACH_xxx`` values
1136 :ref:`amdgpu-ef-amdgpu-mach-table`.
1137 ``EF_AMDGPU_FEATURE_XNACK_V4`` 0x300 XNACK selection mask for
1138 ``EF_AMDGPU_FEATURE_XNACK_*_V4``
1140 ``EF_AMDGPU_FEATURE_XNACK_UNSUPPORTED_V4`` 0x000 XNACK unsuppored.
1141 ``EF_AMDGPU_FEATURE_XNACK_ANY_V4`` 0x100 XNACK can have any value.
1142 ``EF_AMDGPU_FEATURE_XNACK_OFF_V4`` 0x200 XNACK disabled.
1143 ``EF_AMDGPU_FEATURE_XNACK_ON_V4`` 0x300 XNACK enabled.
1144 ``EF_AMDGPU_FEATURE_SRAMECC_V4`` 0xc00 SRAMECC selection mask for
1145 ``EF_AMDGPU_FEATURE_SRAMECC_*_V4``
1147 ``EF_AMDGPU_FEATURE_SRAMECC_UNSUPPORTED_V4`` 0x000 SRAMECC unsuppored.
1148 ``EF_AMDGPU_FEATURE_SRAMECC_ANY_V4`` 0x400 SRAMECC can have any value.
1149 ``EF_AMDGPU_FEATURE_SRAMECC_OFF_V4`` 0x800 SRAMECC disabled,
1150 ``EF_AMDGPU_FEATURE_SRAMECC_ON_V4`` 0xc00 SRAMECC enabled.
1151 ============================================ ===== ===================================
1153 .. table:: AMDGPU ``EF_AMDGPU_MACH`` Values
1154 :name: amdgpu-ef-amdgpu-mach-table
1156 ==================================== ========== =============================
1157 Name Value Description (see
1158 :ref:`amdgpu-processor-table`)
1159 ==================================== ========== =============================
1160 ``EF_AMDGPU_MACH_NONE`` 0x000 *not specified*
1161 ``EF_AMDGPU_MACH_R600_R600`` 0x001 ``r600``
1162 ``EF_AMDGPU_MACH_R600_R630`` 0x002 ``r630``
1163 ``EF_AMDGPU_MACH_R600_RS880`` 0x003 ``rs880``
1164 ``EF_AMDGPU_MACH_R600_RV670`` 0x004 ``rv670``
1165 ``EF_AMDGPU_MACH_R600_RV710`` 0x005 ``rv710``
1166 ``EF_AMDGPU_MACH_R600_RV730`` 0x006 ``rv730``
1167 ``EF_AMDGPU_MACH_R600_RV770`` 0x007 ``rv770``
1168 ``EF_AMDGPU_MACH_R600_CEDAR`` 0x008 ``cedar``
1169 ``EF_AMDGPU_MACH_R600_CYPRESS`` 0x009 ``cypress``
1170 ``EF_AMDGPU_MACH_R600_JUNIPER`` 0x00a ``juniper``
1171 ``EF_AMDGPU_MACH_R600_REDWOOD`` 0x00b ``redwood``
1172 ``EF_AMDGPU_MACH_R600_SUMO`` 0x00c ``sumo``
1173 ``EF_AMDGPU_MACH_R600_BARTS`` 0x00d ``barts``
1174 ``EF_AMDGPU_MACH_R600_CAICOS`` 0x00e ``caicos``
1175 ``EF_AMDGPU_MACH_R600_CAYMAN`` 0x00f ``cayman``
1176 ``EF_AMDGPU_MACH_R600_TURKS`` 0x010 ``turks``
1177 *reserved* 0x011 - Reserved for ``r600``
1178 0x01f architecture processors.
1179 ``EF_AMDGPU_MACH_AMDGCN_GFX600`` 0x020 ``gfx600``
1180 ``EF_AMDGPU_MACH_AMDGCN_GFX601`` 0x021 ``gfx601``
1181 ``EF_AMDGPU_MACH_AMDGCN_GFX700`` 0x022 ``gfx700``
1182 ``EF_AMDGPU_MACH_AMDGCN_GFX701`` 0x023 ``gfx701``
1183 ``EF_AMDGPU_MACH_AMDGCN_GFX702`` 0x024 ``gfx702``
1184 ``EF_AMDGPU_MACH_AMDGCN_GFX703`` 0x025 ``gfx703``
1185 ``EF_AMDGPU_MACH_AMDGCN_GFX704`` 0x026 ``gfx704``
1186 *reserved* 0x027 Reserved.
1187 ``EF_AMDGPU_MACH_AMDGCN_GFX801`` 0x028 ``gfx801``
1188 ``EF_AMDGPU_MACH_AMDGCN_GFX802`` 0x029 ``gfx802``
1189 ``EF_AMDGPU_MACH_AMDGCN_GFX803`` 0x02a ``gfx803``
1190 ``EF_AMDGPU_MACH_AMDGCN_GFX810`` 0x02b ``gfx810``
1191 ``EF_AMDGPU_MACH_AMDGCN_GFX900`` 0x02c ``gfx900``
1192 ``EF_AMDGPU_MACH_AMDGCN_GFX902`` 0x02d ``gfx902``
1193 ``EF_AMDGPU_MACH_AMDGCN_GFX904`` 0x02e ``gfx904``
1194 ``EF_AMDGPU_MACH_AMDGCN_GFX906`` 0x02f ``gfx906``
1195 ``EF_AMDGPU_MACH_AMDGCN_GFX908`` 0x030 ``gfx908``
1196 ``EF_AMDGPU_MACH_AMDGCN_GFX909`` 0x031 ``gfx909``
1197 ``EF_AMDGPU_MACH_AMDGCN_GFX90C`` 0x032 ``gfx90c``
1198 ``EF_AMDGPU_MACH_AMDGCN_GFX1010`` 0x033 ``gfx1010``
1199 ``EF_AMDGPU_MACH_AMDGCN_GFX1011`` 0x034 ``gfx1011``
1200 ``EF_AMDGPU_MACH_AMDGCN_GFX1012`` 0x035 ``gfx1012``
1201 ``EF_AMDGPU_MACH_AMDGCN_GFX1030`` 0x036 ``gfx1030``
1202 ``EF_AMDGPU_MACH_AMDGCN_GFX1031`` 0x037 ``gfx1031``
1203 ``EF_AMDGPU_MACH_AMDGCN_GFX1032`` 0x038 ``gfx1032``
1204 ``EF_AMDGPU_MACH_AMDGCN_GFX1033`` 0x039 ``gfx1033``
1205 ``EF_AMDGPU_MACH_AMDGCN_GFX602`` 0x03a ``gfx602``
1206 ``EF_AMDGPU_MACH_AMDGCN_GFX705`` 0x03b ``gfx705``
1207 ``EF_AMDGPU_MACH_AMDGCN_GFX805`` 0x03c ``gfx805``
1208 ``EF_AMDGPU_MACH_AMDGCN_GFX1035`` 0x03d ``gfx1035``
1209 ``EF_AMDGPU_MACH_AMDGCN_GFX1034`` 0x03e ``gfx1034``
1210 ``EF_AMDGPU_MACH_AMDGCN_GFX90A`` 0x03f ``gfx90a``
1211 *reserved* 0x040 Reserved.
1212 *reserved* 0x041 Reserved.
1213 ``EF_AMDGPU_MACH_AMDGCN_GFX1013`` 0x042 ``gfx1013``
1214 *reserved* 0x043 Reserved.
1215 *reserved* 0x044 Reserved.
1216 *reserved* 0x045 Reserved.
1217 ==================================== ========== =============================
1222 An AMDGPU target ELF code object has the standard ELF sections which include:
1224 .. table:: AMDGPU ELF Sections
1225 :name: amdgpu-elf-sections-table
1227 ================== ================ =================================
1228 Name Type Attributes
1229 ================== ================ =================================
1230 ``.bss`` ``SHT_NOBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
1231 ``.data`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
1232 ``.debug_``\ *\** ``SHT_PROGBITS`` *none*
1233 ``.dynamic`` ``SHT_DYNAMIC`` ``SHF_ALLOC``
1234 ``.dynstr`` ``SHT_PROGBITS`` ``SHF_ALLOC``
1235 ``.dynsym`` ``SHT_PROGBITS`` ``SHF_ALLOC``
1236 ``.got`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
1237 ``.hash`` ``SHT_HASH`` ``SHF_ALLOC``
1238 ``.note`` ``SHT_NOTE`` *none*
1239 ``.rela``\ *name* ``SHT_RELA`` *none*
1240 ``.rela.dyn`` ``SHT_RELA`` *none*
1241 ``.rodata`` ``SHT_PROGBITS`` ``SHF_ALLOC``
1242 ``.shstrtab`` ``SHT_STRTAB`` *none*
1243 ``.strtab`` ``SHT_STRTAB`` *none*
1244 ``.symtab`` ``SHT_SYMTAB`` *none*
1245 ``.text`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_EXECINSTR``
1246 ================== ================ =================================
1248 These sections have their standard meanings (see [ELF]_) and are only generated
1252 The standard DWARF sections. See :ref:`amdgpu-dwarf-debug-information` for
1253 information on the DWARF produced by the AMDGPU backend.
1255 ``.dynamic``, ``.dynstr``, ``.dynsym``, ``.hash``
1256 The standard sections used by a dynamic loader.
1259 See :ref:`amdgpu-note-records` for the note records supported by the AMDGPU
1262 ``.rela``\ *name*, ``.rela.dyn``
1263 For relocatable code objects, *name* is the name of the section that the
1264 relocation records apply. For example, ``.rela.text`` is the section name for
1265 relocation records associated with the ``.text`` section.
1267 For linked shared code objects, ``.rela.dyn`` contains all the relocation
1268 records from each of the relocatable code object's ``.rela``\ *name* sections.
1270 See :ref:`amdgpu-relocation-records` for the relocation records supported by
1274 The executable machine code for the kernels and functions they call. Generated
1275 as position independent code. See :ref:`amdgpu-code-conventions` for
1276 information on conventions used in the isa generation.
1278 .. _amdgpu-note-records:
1283 The AMDGPU backend code object contains ELF note records in the ``.note``
1284 section. The set of generated notes and their semantics depend on the code
1285 object version; see :ref:`amdgpu-note-records-v2` and
1286 :ref:`amdgpu-note-records-v3-v4`.
1288 As required by ``ELFCLASS32`` and ``ELFCLASS64``, minimal zero-byte padding
1289 must be generated after the ``name`` field to ensure the ``desc`` field is 4
1290 byte aligned. In addition, minimal zero-byte padding must be generated to
1291 ensure the ``desc`` field size is a multiple of 4 bytes. The ``sh_addralign``
1292 field of the ``.note`` section must be at least 4 to indicate at least 8 byte
1295 .. _amdgpu-note-records-v2:
1297 Code Object V2 Note Records
1298 ~~~~~~~~~~~~~~~~~~~~~~~~~~~
1301 Code object V2 is not the default code object version emitted by
1302 this version of LLVM.
1304 The AMDGPU backend code object uses the following ELF note record in the
1305 ``.note`` section when compiling for code object V2.
1307 The note record vendor field is "AMD".
1309 Additional note records may be present, but any which are not documented here
1310 are deprecated and should not be used.
1312 .. table:: AMDGPU Code Object V2 ELF Note Records
1313 :name: amdgpu-elf-note-records-v2-table
1315 ===== ===================================== ======================================
1316 Name Type Description
1317 ===== ===================================== ======================================
1318 "AMD" ``NT_AMD_HSA_CODE_OBJECT_VERSION`` Code object version.
1319 "AMD" ``NT_AMD_HSA_HSAIL`` HSAIL properties generated by the HSAIL
1320 Finalizer and not the LLVM compiler.
1321 "AMD" ``NT_AMD_HSA_ISA_VERSION`` Target ISA version.
1322 "AMD" ``NT_AMD_HSA_METADATA`` Metadata null terminated string in
1323 YAML [YAML]_ textual format.
1324 "AMD" ``NT_AMD_HSA_ISA_NAME`` Target ISA name.
1325 ===== ===================================== ======================================
1329 .. table:: AMDGPU Code Object V2 ELF Note Record Enumeration Values
1330 :name: amdgpu-elf-note-record-enumeration-values-v2-table
1332 ===================================== =====
1334 ===================================== =====
1335 ``NT_AMD_HSA_CODE_OBJECT_VERSION`` 1
1336 ``NT_AMD_HSA_HSAIL`` 2
1337 ``NT_AMD_HSA_ISA_VERSION`` 3
1339 ``NT_AMD_HSA_METADATA`` 10
1340 ``NT_AMD_HSA_ISA_NAME`` 11
1341 ===================================== =====
1343 ``NT_AMD_HSA_CODE_OBJECT_VERSION``
1344 Specifies the code object version number. The description field has the
1349 struct amdgpu_hsa_note_code_object_version_s {
1350 uint32_t major_version;
1351 uint32_t minor_version;
1354 The ``major_version`` has a value less than or equal to 2.
1356 ``NT_AMD_HSA_HSAIL``
1357 Specifies the HSAIL properties used by the HSAIL Finalizer. The description
1358 field has the following layout:
1362 struct amdgpu_hsa_note_hsail_s {
1363 uint32_t hsail_major_version;
1364 uint32_t hsail_minor_version;
1366 uint8_t machine_model;
1367 uint8_t default_float_round;
1370 ``NT_AMD_HSA_ISA_VERSION``
1371 Specifies the target ISA version. The description field has the following layout:
1375 struct amdgpu_hsa_note_isa_s {
1376 uint16_t vendor_name_size;
1377 uint16_t architecture_name_size;
1381 char vendor_and_architecture_name[1];
1384 ``vendor_name_size`` and ``architecture_name_size`` are the length of the
1385 vendor and architecture names respectively, including the NUL character.
1387 ``vendor_and_architecture_name`` contains the NUL terminates string for the
1388 vendor, immediately followed by the NUL terminated string for the
1391 This note record is used by the HSA runtime loader.
1393 Code object V2 only supports a limited number of processors and has fixed
1394 settings for target features. See
1395 :ref:`amdgpu-elf-note-record-supported_processors-v2-table` for a list of
1396 processors and the corresponding target ID. In the table the note record ISA
1397 name is a concatenation of the vendor name, architecture name, major, minor,
1398 and stepping separated by a ":".
1400 The target ID column shows the processor name and fixed target features used
1401 by the LLVM compiler. The LLVM compiler does not generate a
1402 ``NT_AMD_HSA_HSAIL`` note record.
1404 A code object generated by the Finalizer also uses code object V2 and always
1405 generates a ``NT_AMD_HSA_HSAIL`` note record. The processor name and
1406 ``sramecc`` target feature is as shown in
1407 :ref:`amdgpu-elf-note-record-supported_processors-v2-table` but the ``xnack``
1408 target feature is specified by the ``EF_AMDGPU_FEATURE_XNACK_V2`` ``e_flags``
1411 ``NT_AMD_HSA_ISA_NAME``
1412 Specifies the target ISA name as a non-NUL terminated string.
1414 This note record is not used by the HSA runtime loader.
1416 See the ``NT_AMD_HSA_ISA_VERSION`` note record description of the code object
1417 V2's limited support of processors and fixed settings for target features.
1419 See :ref:`amdgpu-elf-note-record-supported_processors-v2-table` for a mapping
1420 from the string to the corresponding target ID. If the ``xnack`` target
1421 feature is supported and enabled, the string produced by the LLVM compiler
1422 will may have a ``+xnack`` appended. The Finlizer did not do the appending and
1423 instead used the ``EF_AMDGPU_FEATURE_XNACK_V2`` ``e_flags`` bit.
1425 ``NT_AMD_HSA_METADATA``
1426 Specifies extensible metadata associated with the code objects executed on HSA
1427 [HSA]_ compatible runtimes (see :ref:`amdgpu-os`). It is required when the
1428 target triple OS is ``amdhsa`` (see :ref:`amdgpu-target-triples`). See
1429 :ref:`amdgpu-amdhsa-code-object-metadata-v2` for the syntax of the code object
1432 .. table:: AMDGPU Code Object V2 Supported Processors and Fixed Target Feature Settings
1433 :name: amdgpu-elf-note-record-supported_processors-v2-table
1435 ===================== ==========================
1436 Note Record ISA Name Target ID
1437 ===================== ==========================
1438 ``AMD:AMDGPU:6:0:0`` ``gfx600``
1439 ``AMD:AMDGPU:6:0:1`` ``gfx601``
1440 ``AMD:AMDGPU:6:0:2`` ``gfx602``
1441 ``AMD:AMDGPU:7:0:0`` ``gfx700``
1442 ``AMD:AMDGPU:7:0:1`` ``gfx701``
1443 ``AMD:AMDGPU:7:0:2`` ``gfx702``
1444 ``AMD:AMDGPU:7:0:3`` ``gfx703``
1445 ``AMD:AMDGPU:7:0:4`` ``gfx704``
1446 ``AMD:AMDGPU:7:0:5`` ``gfx705``
1447 ``AMD:AMDGPU:8:0:0`` ``gfx802``
1448 ``AMD:AMDGPU:8:0:1`` ``gfx801:xnack+``
1449 ``AMD:AMDGPU:8:0:2`` ``gfx802``
1450 ``AMD:AMDGPU:8:0:3`` ``gfx803``
1451 ``AMD:AMDGPU:8:0:4`` ``gfx803``
1452 ``AMD:AMDGPU:8:0:5`` ``gfx805``
1453 ``AMD:AMDGPU:8:1:0`` ``gfx810:xnack+``
1454 ``AMD:AMDGPU:9:0:0`` ``gfx900:xnack-``
1455 ``AMD:AMDGPU:9:0:1`` ``gfx900:xnack+``
1456 ``AMD:AMDGPU:9:0:2`` ``gfx902:xnack-``
1457 ``AMD:AMDGPU:9:0:3`` ``gfx902:xnack+``
1458 ``AMD:AMDGPU:9:0:4`` ``gfx904:xnack-``
1459 ``AMD:AMDGPU:9:0:5`` ``gfx904:xnack+``
1460 ``AMD:AMDGPU:9:0:6`` ``gfx906:sramecc-:xnack-``
1461 ``AMD:AMDGPU:9:0:7`` ``gfx906:sramecc-:xnack+``
1462 ``AMD:AMDGPU:9:0:12`` ``gfx90c:xnack-``
1463 ===================== ==========================
1465 .. _amdgpu-note-records-v3-v4:
1467 Code Object V3 to V4 Note Records
1468 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1470 The AMDGPU backend code object uses the following ELF note record in the
1471 ``.note`` section when compiling for code object V3 to V4.
1473 The note record vendor field is "AMDGPU".
1475 Additional note records may be present, but any which are not documented here
1476 are deprecated and should not be used.
1478 .. table:: AMDGPU Code Object V3 to V4 ELF Note Records
1479 :name: amdgpu-elf-note-records-table-v3-v4
1481 ======== ============================== ======================================
1482 Name Type Description
1483 ======== ============================== ======================================
1484 "AMDGPU" ``NT_AMDGPU_METADATA`` Metadata in Message Pack [MsgPack]_
1486 ======== ============================== ======================================
1490 .. table:: AMDGPU Code Object V3 to V4 ELF Note Record Enumeration Values
1491 :name: amdgpu-elf-note-record-enumeration-values-table-v3-v4
1493 ============================== =====
1495 ============================== =====
1497 ``NT_AMDGPU_METADATA`` 32
1498 ============================== =====
1500 ``NT_AMDGPU_METADATA``
1501 Specifies extensible metadata associated with an AMDGPU code object. It is
1502 encoded as a map in the Message Pack [MsgPack]_ binary data format. See
1503 :ref:`amdgpu-amdhsa-code-object-metadata-v3` and
1504 :ref:`amdgpu-amdhsa-code-object-metadata-v4` for the map keys defined for the
1512 Symbols include the following:
1514 .. table:: AMDGPU ELF Symbols
1515 :name: amdgpu-elf-symbols-table
1517 ===================== ================== ================ ==================
1518 Name Type Section Description
1519 ===================== ================== ================ ==================
1520 *link-name* ``STT_OBJECT`` - ``.data`` Global variable
1523 *link-name*\ ``.kd`` ``STT_OBJECT`` - ``.rodata`` Kernel descriptor
1524 *link-name* ``STT_FUNC`` - ``.text`` Kernel entry point
1525 *link-name* ``STT_OBJECT`` - SHN_AMDGPU_LDS Global variable in LDS
1526 ===================== ================== ================ ==================
1529 Global variables both used and defined by the compilation unit.
1531 If the symbol is defined in the compilation unit then it is allocated in the
1532 appropriate section according to if it has initialized data or is readonly.
1534 If the symbol is external then its section is ``STN_UNDEF`` and the loader
1535 will resolve relocations using the definition provided by another code object
1536 or explicitly defined by the runtime.
1538 If the symbol resides in local/group memory (LDS) then its section is the
1539 special processor specific section name ``SHN_AMDGPU_LDS``, and the
1540 ``st_value`` field describes alignment requirements as it does for common
1545 Add description of linked shared object symbols. Seems undefined symbols
1546 are marked as STT_NOTYPE.
1549 Every HSA kernel has an associated kernel descriptor. It is the address of the
1550 kernel descriptor that is used in the AQL dispatch packet used to invoke the
1551 kernel, not the kernel entry point. The layout of the HSA kernel descriptor is
1552 defined in :ref:`amdgpu-amdhsa-kernel-descriptor`.
1555 Every HSA kernel also has a symbol for its machine code entry point.
1557 .. _amdgpu-relocation-records:
1562 AMDGPU backend generates ``Elf64_Rela`` relocation records. Supported
1563 relocatable fields are:
1566 This specifies a 32-bit field occupying 4 bytes with arbitrary byte
1567 alignment. These values use the same byte order as other word values in the
1568 AMDGPU architecture.
1571 This specifies a 64-bit field occupying 8 bytes with arbitrary byte
1572 alignment. These values use the same byte order as other word values in the
1573 AMDGPU architecture.
1575 Following notations are used for specifying relocation calculations:
1578 Represents the addend used to compute the value of the relocatable field.
1581 Represents the offset into the global offset table at which the relocation
1582 entry's symbol will reside during execution.
1585 Represents the address of the global offset table.
1588 Represents the place (section offset for ``et_rel`` or address for ``et_dyn``)
1589 of the storage unit being relocated (computed using ``r_offset``).
1592 Represents the value of the symbol whose index resides in the relocation
1593 entry. Relocations not using this must specify a symbol index of
1597 Represents the base address of a loaded executable or shared object which is
1598 the difference between the ELF address and the actual load address.
1599 Relocations using this are only valid in executable or shared objects.
1601 The following relocation types are supported:
1603 .. table:: AMDGPU ELF Relocation Records
1604 :name: amdgpu-elf-relocation-records-table
1606 ========================== ======= ===== ========== ==============================
1607 Relocation Type Kind Value Field Calculation
1608 ========================== ======= ===== ========== ==============================
1609 ``R_AMDGPU_NONE`` 0 *none* *none*
1610 ``R_AMDGPU_ABS32_LO`` Static, 1 ``word32`` (S + A) & 0xFFFFFFFF
1612 ``R_AMDGPU_ABS32_HI`` Static, 2 ``word32`` (S + A) >> 32
1614 ``R_AMDGPU_ABS64`` Static, 3 ``word64`` S + A
1616 ``R_AMDGPU_REL32`` Static 4 ``word32`` S + A - P
1617 ``R_AMDGPU_REL64`` Static 5 ``word64`` S + A - P
1618 ``R_AMDGPU_ABS32`` Static, 6 ``word32`` S + A
1620 ``R_AMDGPU_GOTPCREL`` Static 7 ``word32`` G + GOT + A - P
1621 ``R_AMDGPU_GOTPCREL32_LO`` Static 8 ``word32`` (G + GOT + A - P) & 0xFFFFFFFF
1622 ``R_AMDGPU_GOTPCREL32_HI`` Static 9 ``word32`` (G + GOT + A - P) >> 32
1623 ``R_AMDGPU_REL32_LO`` Static 10 ``word32`` (S + A - P) & 0xFFFFFFFF
1624 ``R_AMDGPU_REL32_HI`` Static 11 ``word32`` (S + A - P) >> 32
1626 ``R_AMDGPU_RELATIVE64`` Dynamic 13 ``word64`` B + A
1627 ``R_AMDGPU_REL16`` Static 14 ``word16`` ((S + A - P) - 4) / 4
1628 ========================== ======= ===== ========== ==============================
1630 ``R_AMDGPU_ABS32_LO`` and ``R_AMDGPU_ABS32_HI`` are only supported by
1631 the ``mesa3d`` OS, which does not support ``R_AMDGPU_ABS64``.
1633 There is no current OS loader support for 32-bit programs and so
1634 ``R_AMDGPU_ABS32`` is not used.
1636 .. _amdgpu-loaded-code-object-path-uniform-resource-identifier:
1638 Loaded Code Object Path Uniform Resource Identifier (URI)
1639 ---------------------------------------------------------
1641 The AMD GPU code object loader represents the path of the ELF shared object from
1642 which the code object was loaded as a textual Uniform Resource Identifier (URI).
1643 Note that the code object is the in memory loaded relocated form of the ELF
1644 shared object. Multiple code objects may be loaded at different memory
1645 addresses in the same process from the same ELF shared object.
1647 The loaded code object path URI syntax is defined by the following BNF syntax:
1651 code_object_uri ::== file_uri | memory_uri
1652 file_uri ::== "file://" file_path [ range_specifier ]
1653 memory_uri ::== "memory://" process_id range_specifier
1654 range_specifier ::== [ "#" | "?" ] "offset=" number "&" "size=" number
1655 file_path ::== URI_ENCODED_OS_FILE_PATH
1656 process_id ::== DECIMAL_NUMBER
1657 number ::== HEX_NUMBER | DECIMAL_NUMBER | OCTAL_NUMBER
1660 Is a C integral literal where hexadecimal values are prefixed by "0x" or "0X",
1661 and octal values by "0".
1664 Is the file's path specified as a URI encoded UTF-8 string. In URI encoding,
1665 every character that is not in the regular expression ``[a-zA-Z0-9/_.~-]`` is
1666 encoded as two uppercase hexadecimal digits proceeded by "%". Directories in
1667 the path are separated by "/".
1670 Is a 0-based byte offset to the start of the code object. For a file URI, it
1671 is from the start of the file specified by the ``file_path``, and if omitted
1672 defaults to 0. For a memory URI, it is the memory address and is required.
1675 Is the number of bytes in the code object. For a file URI, if omitted it
1676 defaults to the size of the file. It is required for a memory URI.
1679 Is the identity of the process owning the memory. For Linux it is the C
1680 unsigned integral decimal literal for the process ID (PID).
1686 file:///dir1/dir2/file1
1687 file:///dir3/dir4/file2#offset=0x2000&size=3000
1688 memory://1234#offset=0x20000&size=3000
1690 .. _amdgpu-dwarf-debug-information:
1692 DWARF Debug Information
1693 =======================
1697 This section describes **provisional support** for AMDGPU DWARF [DWARF]_ that
1698 is not currently fully implemented and is subject to change.
1700 AMDGPU generates DWARF [DWARF]_ debugging information ELF sections (see
1701 :ref:`amdgpu-elf-code-object`) which contain information that maps the code
1702 object executable code and data to the source language constructs. It can be
1703 used by tools such as debuggers and profilers. It uses features defined in
1704 :doc:`AMDGPUDwarfExtensionsForHeterogeneousDebugging` that are made available in
1705 DWARF Version 4 and DWARF Version 5 as an LLVM vendor extension.
1707 This section defines the AMDGPU target architecture specific DWARF mappings.
1709 .. _amdgpu-dwarf-register-identifier:
1714 This section defines the AMDGPU target architecture register numbers used in
1715 DWARF operation expressions (see DWARF Version 5 section 2.5 and
1716 :ref:`amdgpu-dwarf-operation-expressions`) and Call Frame Information
1717 instructions (see DWARF Version 5 section 6.4 and
1718 :ref:`amdgpu-dwarf-call-frame-information`).
1720 A single code object can contain code for kernels that have different wavefront
1721 sizes. The vector registers and some scalar registers are based on the wavefront
1722 size. AMDGPU defines distinct DWARF registers for each wavefront size. This
1723 simplifies the consumer of the DWARF so that each register has a fixed size,
1724 rather than being dynamic according to the wavefront size mode. Similarly,
1725 distinct DWARF registers are defined for those registers that vary in size
1726 according to the process address size. This allows a consumer to treat a
1727 specific AMDGPU processor as a single architecture regardless of how it is
1728 configured at run time. The compiler explicitly specifies the DWARF registers
1729 that match the mode in which the code it is generating will be executed.
1731 DWARF registers are encoded as numbers, which are mapped to architecture
1732 registers. The mapping for AMDGPU is defined in
1733 :ref:`amdgpu-dwarf-register-mapping-table`. All AMDGPU targets use the same
1736 .. table:: AMDGPU DWARF Register Mapping
1737 :name: amdgpu-dwarf-register-mapping-table
1739 ============== ================= ======== ==================================
1740 DWARF Register AMDGPU Register Bit Size Description
1741 ============== ================= ======== ==================================
1742 0 PC_32 32 Program Counter (PC) when
1743 executing in a 32-bit process
1744 address space. Used in the CFI to
1745 describe the PC of the calling
1747 1 EXEC_MASK_32 32 Execution Mask Register when
1748 executing in wavefront 32 mode.
1749 2-15 *Reserved* *Reserved for highly accessed
1750 registers using DWARF shortcut.*
1751 16 PC_64 64 Program Counter (PC) when
1752 executing in a 64-bit process
1753 address space. Used in the CFI to
1754 describe the PC of the calling
1756 17 EXEC_MASK_64 64 Execution Mask Register when
1757 executing in wavefront 64 mode.
1758 18-31 *Reserved* *Reserved for highly accessed
1759 registers using DWARF shortcut.*
1760 32-95 SGPR0-SGPR63 32 Scalar General Purpose
1762 96-127 *Reserved* *Reserved for frequently accessed
1763 registers using DWARF 1-byte ULEB.*
1764 128 STATUS 32 Status Register.
1765 129-511 *Reserved* *Reserved for future Scalar
1766 Architectural Registers.*
1767 512 VCC_32 32 Vector Condition Code Register
1768 when executing in wavefront 32
1770 513-767 *Reserved* *Reserved for future Vector
1771 Architectural Registers when
1772 executing in wavefront 32 mode.*
1773 768 VCC_64 64 Vector Condition Code Register
1774 when executing in wavefront 64
1776 769-1023 *Reserved* *Reserved for future Vector
1777 Architectural Registers when
1778 executing in wavefront 64 mode.*
1779 1024-1087 *Reserved* *Reserved for padding.*
1780 1088-1129 SGPR64-SGPR105 32 Scalar General Purpose Registers.
1781 1130-1535 *Reserved* *Reserved for future Scalar
1782 General Purpose Registers.*
1783 1536-1791 VGPR0-VGPR255 32*32 Vector General Purpose Registers
1784 when executing in wavefront 32
1786 1792-2047 *Reserved* *Reserved for future Vector
1787 General Purpose Registers when
1788 executing in wavefront 32 mode.*
1789 2048-2303 AGPR0-AGPR255 32*32 Vector Accumulation Registers
1790 when executing in wavefront 32
1792 2304-2559 *Reserved* *Reserved for future Vector
1793 Accumulation Registers when
1794 executing in wavefront 32 mode.*
1795 2560-2815 VGPR0-VGPR255 64*32 Vector General Purpose Registers
1796 when executing in wavefront 64
1798 2816-3071 *Reserved* *Reserved for future Vector
1799 General Purpose Registers when
1800 executing in wavefront 64 mode.*
1801 3072-3327 AGPR0-AGPR255 64*32 Vector Accumulation Registers
1802 when executing in wavefront 64
1804 3328-3583 *Reserved* *Reserved for future Vector
1805 Accumulation Registers when
1806 executing in wavefront 64 mode.*
1807 ============== ================= ======== ==================================
1809 The vector registers are represented as the full size for the wavefront. They
1810 are organized as consecutive dwords (32-bits), one per lane, with the dword at
1811 the least significant bit position corresponding to lane 0 and so forth. DWARF
1812 location expressions involving the ``DW_OP_LLVM_offset`` and
1813 ``DW_OP_LLVM_push_lane`` operations are used to select the part of the vector
1814 register corresponding to the lane that is executing the current thread of
1815 execution in languages that are implemented using a SIMD or SIMT execution
1818 If the wavefront size is 32 lanes then the wavefront 32 mode register
1819 definitions are used. If the wavefront size is 64 lanes then the wavefront 64
1820 mode register definitions are used. Some AMDGPU targets support executing in
1821 both wavefront 32 and wavefront 64 mode. The register definitions corresponding
1822 to the wavefront mode of the generated code will be used.
1824 If code is generated to execute in a 32-bit process address space, then the
1825 32-bit process address space register definitions are used. If code is generated
1826 to execute in a 64-bit process address space, then the 64-bit process address
1827 space register definitions are used. The ``amdgcn`` target only supports the
1828 64-bit process address space.
1830 .. _amdgpu-dwarf-address-class-identifier:
1832 Address Class Identifier
1833 ------------------------
1835 The DWARF address class represents the source language memory space. See DWARF
1836 Version 5 section 2.12 which is updated by the *DWARF Extensions For
1837 Heterogeneous Debugging* section :ref:`amdgpu-dwarf-segment_addresses`.
1839 The DWARF address class mapping used for AMDGPU is defined in
1840 :ref:`amdgpu-dwarf-address-class-mapping-table`.
1842 .. table:: AMDGPU DWARF Address Class Mapping
1843 :name: amdgpu-dwarf-address-class-mapping-table
1845 ========================= ====== =================
1847 -------------------------------- -----------------
1848 Address Class Name Value Address Space
1849 ========================= ====== =================
1850 ``DW_ADDR_none`` 0x0000 Generic (Flat)
1851 ``DW_ADDR_LLVM_global`` 0x0001 Global
1852 ``DW_ADDR_LLVM_constant`` 0x0002 Global
1853 ``DW_ADDR_LLVM_group`` 0x0003 Local (group/LDS)
1854 ``DW_ADDR_LLVM_private`` 0x0004 Private (Scratch)
1855 ``DW_ADDR_AMDGPU_region`` 0x8000 Region (GDS)
1856 ========================= ====== =================
1858 The DWARF address class values defined in the *DWARF Extensions For
1859 Heterogeneous Debugging* section :ref:`amdgpu-dwarf-segment_addresses` are used.
1861 In addition, ``DW_ADDR_AMDGPU_region`` is encoded as a vendor extension. This is
1862 available for use for the AMD extension for access to the hardware GDS memory
1863 which is scratchpad memory allocated per device.
1865 For AMDGPU if no ``DW_AT_address_class`` attribute is present, then the default
1866 address class of ``DW_ADDR_none`` is used.
1868 See :ref:`amdgpu-dwarf-address-space-identifier` for information on the AMDGPU
1869 mapping of DWARF address classes to DWARF address spaces, including address size
1872 .. _amdgpu-dwarf-address-space-identifier:
1874 Address Space Identifier
1875 ------------------------
1877 DWARF address spaces correspond to target architecture specific linear
1878 addressable memory areas. See DWARF Version 5 section 2.12 and *DWARF Extensions
1879 For Heterogeneous Debugging* section :ref:`amdgpu-dwarf-segment_addresses`.
1881 The DWARF address space mapping used for AMDGPU is defined in
1882 :ref:`amdgpu-dwarf-address-space-mapping-table`.
1884 .. table:: AMDGPU DWARF Address Space Mapping
1885 :name: amdgpu-dwarf-address-space-mapping-table
1887 ======================================= ===== ======= ======== ================= =======================
1889 --------------------------------------- ----- ---------------- ----------------- -----------------------
1890 Address Space Name Value Address Bit Size Address Space
1891 --------------------------------------- ----- ------- -------- ----------------- -----------------------
1896 ======================================= ===== ======= ======== ================= =======================
1897 ``DW_ASPACE_none`` 0x00 64 32 Global *default address space*
1898 ``DW_ASPACE_AMDGPU_generic`` 0x01 64 32 Generic (Flat)
1899 ``DW_ASPACE_AMDGPU_region`` 0x02 32 32 Region (GDS)
1900 ``DW_ASPACE_AMDGPU_local`` 0x03 32 32 Local (group/LDS)
1902 ``DW_ASPACE_AMDGPU_private_lane`` 0x05 32 32 Private (Scratch) *focused lane*
1903 ``DW_ASPACE_AMDGPU_private_wave`` 0x06 32 32 Private (Scratch) *unswizzled wavefront*
1904 ======================================= ===== ======= ======== ================= =======================
1906 See :ref:`amdgpu-address-spaces` for information on the AMDGPU address spaces
1907 including address size and NULL value.
1909 The ``DW_ASPACE_none`` address space is the default target architecture address
1910 space used in DWARF operations that do not specify an address space. It
1911 therefore has to map to the global address space so that the ``DW_OP_addr*`` and
1912 related operations can refer to addresses in the program code.
1914 The ``DW_ASPACE_AMDGPU_generic`` address space allows location expressions to
1915 specify the flat address space. If the address corresponds to an address in the
1916 local address space, then it corresponds to the wavefront that is executing the
1917 focused thread of execution. If the address corresponds to an address in the
1918 private address space, then it corresponds to the lane that is executing the
1919 focused thread of execution for languages that are implemented using a SIMD or
1920 SIMT execution model.
1924 CUDA-like languages such as HIP that do not have address spaces in the
1925 language type system, but do allow variables to be allocated in different
1926 address spaces, need to explicitly specify the ``DW_ASPACE_AMDGPU_generic``
1927 address space in the DWARF expression operations as the default address space
1928 is the global address space.
1930 The ``DW_ASPACE_AMDGPU_local`` address space allows location expressions to
1931 specify the local address space corresponding to the wavefront that is executing
1932 the focused thread of execution.
1934 The ``DW_ASPACE_AMDGPU_private_lane`` address space allows location expressions
1935 to specify the private address space corresponding to the lane that is executing
1936 the focused thread of execution for languages that are implemented using a SIMD
1937 or SIMT execution model.
1939 The ``DW_ASPACE_AMDGPU_private_wave`` address space allows location expressions
1940 to specify the unswizzled private address space corresponding to the wavefront
1941 that is executing the focused thread of execution. The wavefront view of private
1942 memory is the per wavefront unswizzled backing memory layout defined in
1943 :ref:`amdgpu-address-spaces`, such that address 0 corresponds to the first
1944 location for the backing memory of the wavefront (namely the address is not
1945 offset by ``wavefront-scratch-base``). The following formula can be used to
1946 convert from a ``DW_ASPACE_AMDGPU_private_lane`` address to a
1947 ``DW_ASPACE_AMDGPU_private_wave`` address:
1951 private-address-wavefront =
1952 ((private-address-lane / 4) * wavefront-size * 4) +
1953 (wavefront-lane-id * 4) + (private-address-lane % 4)
1955 If the ``DW_ASPACE_AMDGPU_private_lane`` address is dword aligned, and the start
1956 of the dwords for each lane starting with lane 0 is required, then this
1961 private-address-wavefront =
1962 private-address-lane * wavefront-size
1964 A compiler can use the ``DW_ASPACE_AMDGPU_private_wave`` address space to read a
1965 complete spilled vector register back into a complete vector register in the
1966 CFI. The frame pointer can be a private lane address which is dword aligned,
1967 which can be shifted to multiply by the wavefront size, and then used to form a
1968 private wavefront address that gives a location for a contiguous set of dwords,
1969 one per lane, where the vector register dwords are spilled. The compiler knows
1970 the wavefront size since it generates the code. Note that the type of the
1971 address may have to be converted as the size of a
1972 ``DW_ASPACE_AMDGPU_private_lane`` address may be smaller than the size of a
1973 ``DW_ASPACE_AMDGPU_private_wave`` address.
1975 .. _amdgpu-dwarf-lane-identifier:
1980 DWARF lane identifies specify a target architecture lane position for hardware
1981 that executes in a SIMD or SIMT manner, and on which a source language maps its
1982 threads of execution onto those lanes. The DWARF lane identifier is pushed by
1983 the ``DW_OP_LLVM_push_lane`` DWARF expression operation. See DWARF Version 5
1984 section 2.5 which is updated by *DWARF Extensions For Heterogeneous Debugging*
1985 section :ref:`amdgpu-dwarf-operation-expressions`.
1987 For AMDGPU, the lane identifier corresponds to the hardware lane ID of a
1988 wavefront. It is numbered from 0 to the wavefront size minus 1.
1990 Operation Expressions
1991 ---------------------
1993 DWARF expressions are used to compute program values and the locations of
1994 program objects. See DWARF Version 5 section 2.5 and
1995 :ref:`amdgpu-dwarf-operation-expressions`.
1997 DWARF location descriptions describe how to access storage which includes memory
1998 and registers. When accessing storage on AMDGPU, bytes are ordered with least
1999 significant bytes first, and bits are ordered within bytes with least
2000 significant bits first.
2002 For AMDGPU CFI expressions, ``DW_OP_LLVM_select_bit_piece`` is used to describe
2003 unwinding vector registers that are spilled under the execution mask to memory:
2004 the zero-single location description is the vector register, and the one-single
2005 location description is the spilled memory location description. The
2006 ``DW_OP_LLVM_form_aspace_address`` is used to specify the address space of the
2007 memory location description.
2009 In AMDGPU expressions, ``DW_OP_LLVM_select_bit_piece`` is used by the
2010 ``DW_AT_LLVM_lane_pc`` attribute expression where divergent control flow is
2011 controlled by the execution mask. An undefined location description together
2012 with ``DW_OP_LLVM_extend`` is used to indicate the lane was not active on entry
2013 to the subprogram. See :ref:`amdgpu-dwarf-dw-at-llvm-lane-pc` for an example.
2015 Debugger Information Entry Attributes
2016 -------------------------------------
2018 This section describes how certain debugger information entry attributes are
2019 used by AMDGPU. See the sections in DWARF Version 5 section 3.3.5 and 3.1.1
2020 which are updated by *DWARF Extensions For Heterogeneous Debugging* section
2021 :ref:`amdgpu-dwarf-low-level-information` and
2022 :ref:`amdgpu-dwarf-full-and-partial-compilation-unit-entries`.
2024 .. _amdgpu-dwarf-dw-at-llvm-lane-pc:
2026 ``DW_AT_LLVM_lane_pc``
2027 ~~~~~~~~~~~~~~~~~~~~~~
2029 For AMDGPU, the ``DW_AT_LLVM_lane_pc`` attribute is used to specify the program
2030 location of the separate lanes of a SIMT thread.
2032 If the lane is an active lane then this will be the same as the current program
2035 If the lane is inactive, but was active on entry to the subprogram, then this is
2036 the program location in the subprogram at which execution of the lane is
2037 conceptual positioned.
2039 If the lane was not active on entry to the subprogram, then this will be the
2040 undefined location. A client debugger can check if the lane is part of a valid
2041 work-group by checking that the lane is in the range of the associated
2042 work-group within the grid, accounting for partial work-groups. If it is not,
2043 then the debugger can omit any information for the lane. Otherwise, the debugger
2044 may repeatedly unwind the stack and inspect the ``DW_AT_LLVM_lane_pc`` of the
2045 calling subprogram until it finds a non-undefined location. Conceptually the
2046 lane only has the call frames that it has a non-undefined
2047 ``DW_AT_LLVM_lane_pc``.
2049 The following example illustrates how the AMDGPU backend can generate a DWARF
2050 location list expression for the nested ``IF/THEN/ELSE`` structures of the
2051 following subprogram pseudo code for a target with 64 lanes per wavefront.
2073 The AMDGPU backend may generate the following pseudo LLVM MIR to manipulate the
2074 execution mask (``EXEC``) to linearize the control flow. The condition is
2075 evaluated to make a mask of the lanes for which the condition evaluates to true.
2076 First the ``THEN`` region is executed by setting the ``EXEC`` mask to the
2077 logical ``AND`` of the current ``EXEC`` mask with the condition mask. Then the
2078 ``ELSE`` region is executed by negating the ``EXEC`` mask and logical ``AND`` of
2079 the saved ``EXEC`` mask at the start of the region. After the ``IF/THEN/ELSE``
2080 region the ``EXEC`` mask is restored to the value it had at the beginning of the
2081 region. This is shown below. Other approaches are possible, but the basic
2082 concept is the same.
2115 To create the DWARF location list expression that defines the location
2116 description of a vector of lane program locations, the LLVM MIR ``DBG_VALUE``
2117 pseudo instruction can be used to annotate the linearized control flow. This can
2118 be done by defining an artificial variable for the lane PC. The DWARF location
2119 list expression created for it is used as the value of the
2120 ``DW_AT_LLVM_lane_pc`` attribute on the subprogram's debugger information entry.
2122 A DWARF procedure is defined for each well nested structured control flow region
2123 which provides the conceptual lane program location for a lane if it is not
2124 active (namely it is divergent). The DWARF operation expression for each region
2125 conceptually inherits the value of the immediately enclosing region and modifies
2126 it according to the semantics of the region.
2128 For an ``IF/THEN/ELSE`` region the divergent program location is at the start of
2129 the region for the ``THEN`` region since it is executed first. For the ``ELSE``
2130 region the divergent program location is at the end of the ``IF/THEN/ELSE``
2131 region since the ``THEN`` region has completed.
2133 The lane PC artificial variable is assigned at each region transition. It uses
2134 the immediately enclosing region's DWARF procedure to compute the program
2135 location for each lane assuming they are divergent, and then modifies the result
2136 by inserting the current program location for each lane that the ``EXEC`` mask
2137 indicates is active.
2139 By having separate DWARF procedures for each region, they can be reused to
2140 define the value for any nested region. This reduces the total size of the DWARF
2141 operation expressions.
2143 The following provides an example using pseudo LLVM MIR.
2149 DEFINE_DWARF %__uint_64 = DW_TAG_base_type[
2150 DW_AT_name = "__uint64";
2151 DW_AT_byte_size = 8;
2152 DW_AT_encoding = DW_ATE_unsigned;
2154 DEFINE_DWARF %__active_lane_pc = DW_TAG_dwarf_procedure[
2155 DW_AT_name = "__active_lane_pc";
2158 DW_OP_LLVM_extend 64, 64;
2159 DW_OP_regval_type EXEC, %uint_64;
2160 DW_OP_LLVM_select_bit_piece 64, 64;
2163 DEFINE_DWARF %__divergent_lane_pc = DW_TAG_dwarf_procedure[
2164 DW_AT_name = "__divergent_lane_pc";
2166 DW_OP_LLVM_undefined;
2167 DW_OP_LLVM_extend 64, 64;
2170 DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[
2171 DW_OP_call_ref %__divergent_lane_pc;
2172 DW_OP_call_ref %__active_lane_pc;
2176 DBG_VALUE %1, $noreg, %__lex_1_save_exec;
2181 DEFINE_DWARF %__divergent_lane_pc_1_then = DW_TAG_dwarf_procedure[
2182 DW_AT_name = "__divergent_lane_pc_1_then";
2183 DW_AT_location = DIExpression[
2184 DW_OP_call_ref %__divergent_lane_pc;
2185 DW_OP_addrx &lex_1_start;
2187 DW_OP_LLVM_extend 64, 64;
2188 DW_OP_call_ref %__lex_1_save_exec;
2189 DW_OP_deref_type 64, %__uint_64;
2190 DW_OP_LLVM_select_bit_piece 64, 64;
2193 DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[
2194 DW_OP_call_ref %__divergent_lane_pc_1_then;
2195 DW_OP_call_ref %__active_lane_pc;
2199 DBG_VALUE %3, %__lex_1_1_save_exec;
2204 DEFINE_DWARF %__divergent_lane_pc_1_1_then = DW_TAG_dwarf_procedure[
2205 DW_AT_name = "__divergent_lane_pc_1_1_then";
2206 DW_AT_location = DIExpression[
2207 DW_OP_call_ref %__divergent_lane_pc_1_then;
2208 DW_OP_addrx &lex_1_1_start;
2210 DW_OP_LLVM_extend 64, 64;
2211 DW_OP_call_ref %__lex_1_1_save_exec;
2212 DW_OP_deref_type 64, %__uint_64;
2213 DW_OP_LLVM_select_bit_piece 64, 64;
2216 DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[
2217 DW_OP_call_ref %__divergent_lane_pc_1_1_then;
2218 DW_OP_call_ref %__active_lane_pc;
2223 DEFINE_DWARF %__divergent_lane_pc_1_1_else = DW_TAG_dwarf_procedure[
2224 DW_AT_name = "__divergent_lane_pc_1_1_else";
2225 DW_AT_location = DIExpression[
2226 DW_OP_call_ref %__divergent_lane_pc_1_then;
2227 DW_OP_addrx &lex_1_1_end;
2229 DW_OP_LLVM_extend 64, 64;
2230 DW_OP_call_ref %__lex_1_1_save_exec;
2231 DW_OP_deref_type 64, %__uint_64;
2232 DW_OP_LLVM_select_bit_piece 64, 64;
2235 DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[
2236 DW_OP_call_ref %__divergent_lane_pc_1_1_else;
2237 DW_OP_call_ref %__active_lane_pc;
2242 DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[
2243 DW_OP_call_ref %__divergent_lane_pc;
2244 DW_OP_call_ref %__active_lane_pc;
2249 DEFINE_DWARF %__divergent_lane_pc_1_else = DW_TAG_dwarf_procedure[
2250 DW_AT_name = "__divergent_lane_pc_1_else";
2251 DW_AT_location = DIExpression[
2252 DW_OP_call_ref %__divergent_lane_pc;
2253 DW_OP_addrx &lex_1_end;
2255 DW_OP_LLVM_extend 64, 64;
2256 DW_OP_call_ref %__lex_1_save_exec;
2257 DW_OP_deref_type 64, %__uint_64;
2258 DW_OP_LLVM_select_bit_piece 64, 64;
2261 DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[
2262 DW_OP_call_ref %__divergent_lane_pc_1_else;
2263 DW_OP_call_ref %__active_lane_pc;
2268 DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc DIExpression[
2269 DW_OP_call_ref %__divergent_lane_pc;
2270 DW_OP_call_ref %__active_lane_pc;
2275 The DWARF procedure ``%__active_lane_pc`` is used to update the lane pc elements
2276 that are active, with the current program location.
2278 Artificial variables %__lex_1_save_exec and %__lex_1_1_save_exec are created for
2279 the execution masks saved on entry to a region. Using the ``DBG_VALUE`` pseudo
2280 instruction, location list entries will be created that describe where the
2281 artificial variables are allocated at any given program location. The compiler
2282 may allocate them to registers or spill them to memory.
2284 The DWARF procedures for each region use the values of the saved execution mask
2285 artificial variables to only update the lanes that are active on entry to the
2286 region. All other lanes retain the value of the enclosing region where they were
2287 last active. If they were not active on entry to the subprogram, then will have
2288 the undefined location description.
2290 Other structured control flow regions can be handled similarly. For example,
2291 loops would set the divergent program location for the region at the end of the
2292 loop. Any lanes active will be in the loop, and any lanes not active must have
2295 An ``IF/THEN/ELSEIF/ELSEIF/...`` region can be treated as a nest of
2296 ``IF/THEN/ELSE`` regions.
2298 The DWARF procedures can use the active lane artificial variable described in
2299 :ref:`amdgpu-dwarf-amdgpu-dw-at-llvm-active-lane` rather than the actual
2300 ``EXEC`` mask in order to support whole or quad wavefront mode.
2302 .. _amdgpu-dwarf-amdgpu-dw-at-llvm-active-lane:
2304 ``DW_AT_LLVM_active_lane``
2305 ~~~~~~~~~~~~~~~~~~~~~~~~~~
2307 The ``DW_AT_LLVM_active_lane`` attribute on a subprogram debugger information
2308 entry is used to specify the lanes that are conceptually active for a SIMT
2311 The execution mask may be modified to implement whole or quad wavefront mode
2312 operations. For example, all lanes may need to temporarily be made active to
2313 execute a whole wavefront operation. Such regions would save the ``EXEC`` mask,
2314 update it to enable the necessary lanes, perform the operations, and then
2315 restore the ``EXEC`` mask from the saved value. While executing the whole
2316 wavefront region, the conceptual execution mask is the saved value, not the
2319 This is handled by defining an artificial variable for the active lane mask. The
2320 active lane mask artificial variable would be the actual ``EXEC`` mask for
2321 normal regions, and the saved execution mask for regions where the mask is
2322 temporarily updated. The location list expression created for this artificial
2323 variable is used to define the value of the ``DW_AT_LLVM_active_lane``
2326 ``DW_AT_LLVM_augmentation``
2327 ~~~~~~~~~~~~~~~~~~~~~~~~~~~
2329 For AMDGPU, the ``DW_AT_LLVM_augmentation`` attribute of a compilation unit
2330 debugger information entry has the following value for the augmentation string:
2336 The "vX.Y" specifies the major X and minor Y version number of the AMDGPU
2337 extensions used in the DWARF of the compilation unit. The version number
2338 conforms to [SEMVER]_.
2340 Call Frame Information
2341 ----------------------
2343 DWARF Call Frame Information (CFI) describes how a consumer can virtually
2344 *unwind* call frames in a running process or core dump. See DWARF Version 5
2345 section 6.4 and :ref:`amdgpu-dwarf-call-frame-information`.
2347 For AMDGPU, the Common Information Entry (CIE) fields have the following values:
2349 1. ``augmentation`` string contains the following null-terminated UTF-8 string:
2355 The ``vX.Y`` specifies the major X and minor Y version number of the AMDGPU
2356 extensions used in this CIE or to the FDEs that use it. The version number
2357 conforms to [SEMVER]_.
2359 2. ``address_size`` for the ``Global`` address space is defined in
2360 :ref:`amdgpu-dwarf-address-space-identifier`.
2362 3. ``segment_selector_size`` is 0 as AMDGPU does not use a segment selector.
2364 4. ``code_alignment_factor`` is 4 bytes.
2368 Add to :ref:`amdgpu-processor-table` table.
2370 5. ``data_alignment_factor`` is 4 bytes.
2374 Add to :ref:`amdgpu-processor-table` table.
2376 6. ``return_address_register`` is ``PC_32`` for 32-bit processes and ``PC_64``
2377 for 64-bit processes defined in :ref:`amdgpu-dwarf-register-identifier`.
2379 7. ``initial_instructions`` Since a subprogram X with fewer registers can be
2380 called from subprogram Y that has more allocated, X will not change any of
2381 the extra registers as it cannot access them. Therefore, the default rule
2382 for all columns is ``same value``.
2384 For AMDGPU the register number follows the numbering defined in
2385 :ref:`amdgpu-dwarf-register-identifier`.
2387 For AMDGPU the instructions are variable size. A consumer can subtract 1 from
2388 the return address to get the address of a byte within the call site
2389 instructions. See DWARF Version 5 section 6.4.4.
2394 See DWARF Version 5 section 6.1.
2396 Lookup By Name Section Header
2397 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
2399 See DWARF Version 5 section 6.1.1.4.1 and :ref:`amdgpu-dwarf-lookup-by-name`.
2401 For AMDGPU the lookup by name section header table:
2403 ``augmentation_string_size`` (uword)
2405 Set to the length of the ``augmentation_string`` value which is always a
2408 ``augmentation_string`` (sequence of UTF-8 characters)
2410 Contains the following UTF-8 string null padded to a multiple of 4 bytes:
2416 The "vX.Y" specifies the major X and minor Y version number of the AMDGPU
2417 extensions used in the DWARF of this index. The version number conforms to
2422 This is different to the DWARF Version 5 definition that requires the first
2423 4 characters to be the vendor ID. But this is consistent with the other
2424 augmentation strings and does allow multiple vendor contributions. However,
2425 backwards compatibility may be more desirable.
2427 Lookup By Address Section Header
2428 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
2430 See DWARF Version 5 section 6.1.2.
2432 For AMDGPU the lookup by address section header table:
2434 ``address_size`` (ubyte)
2436 Match the address size for the ``Global`` address space defined in
2437 :ref:`amdgpu-dwarf-address-space-identifier`.
2439 ``segment_selector_size`` (ubyte)
2441 AMDGPU does not use a segment selector so this is 0. The entries in the
2442 ``.debug_aranges`` do not have a segment selector.
2444 Line Number Information
2445 -----------------------
2447 See DWARF Version 5 section 6.2 and :ref:`amdgpu-dwarf-line-number-information`.
2449 AMDGPU does not use the ``isa`` state machine registers and always sets it to 0.
2450 The instruction set must be obtained from the ELF file header ``e_flags`` field
2451 in the ``EF_AMDGPU_MACH`` bit position (see :ref:`ELF Header
2452 <amdgpu-elf-header>`). See DWARF Version 5 section 6.2.2.
2456 Should the ``isa`` state machine register be used to indicate if the code is
2457 in wavefront32 or wavefront64 mode? Or used to specify the architecture ISA?
2459 For AMDGPU the line number program header fields have the following values (see
2460 DWARF Version 5 section 6.2.4):
2462 ``address_size`` (ubyte)
2463 Matches the address size for the ``Global`` address space defined in
2464 :ref:`amdgpu-dwarf-address-space-identifier`.
2466 ``segment_selector_size`` (ubyte)
2467 AMDGPU does not use a segment selector so this is 0.
2469 ``minimum_instruction_length`` (ubyte)
2470 For GFX9-GFX10 this is 4.
2472 ``maximum_operations_per_instruction`` (ubyte)
2473 For GFX9-GFX10 this is 1.
2475 Source text for online-compiled programs (for example, those compiled by the
2476 OpenCL language runtime) may be embedded into the DWARF Version 5 line table.
2477 See DWARF Version 5 section 6.2.4.1 which is updated by *DWARF Extensions For
2478 Heterogeneous Debugging* section :ref:`DW_LNCT_LLVM_source
2479 <amdgpu-dwarf-line-number-information-dw-lnct-llvm-source>`.
2481 The Clang option used to control source embedding in AMDGPU is defined in
2482 :ref:`amdgpu-clang-debug-options-table`.
2484 .. table:: AMDGPU Clang Debug Options
2485 :name: amdgpu-clang-debug-options-table
2487 ==================== ==================================================
2488 Debug Flag Description
2489 ==================== ==================================================
2490 -g[no-]embed-source Enable/disable embedding source text in DWARF
2491 debug sections. Useful for environments where
2492 source cannot be written to disk, such as
2493 when performing online compilation.
2494 ==================== ==================================================
2499 Enable the embedded source.
2501 ``-gno-embed-source``
2502 Disable the embedded source.
2504 32-Bit and 64-Bit DWARF Formats
2505 -------------------------------
2507 See DWARF Version 5 section 7.4 and
2508 :ref:`amdgpu-dwarf-32-bit-and-64-bit-dwarf-formats`.
2512 * For the ``amdgcn`` target architecture only the 64-bit process address space
2515 * The producer can generate either 32-bit or 64-bit DWARF format. LLVM generates
2516 the 32-bit DWARF format.
2521 For AMDGPU the following values apply for each of the unit headers described in
2522 DWARF Version 5 sections 7.5.1.1, 7.5.1.2, and 7.5.1.3:
2524 ``address_size`` (ubyte)
2525 Matches the address size for the ``Global`` address space defined in
2526 :ref:`amdgpu-dwarf-address-space-identifier`.
2528 .. _amdgpu-code-conventions:
2533 This section provides code conventions used for each supported target triple OS
2534 (see :ref:`amdgpu-target-triples`).
2539 This section provides code conventions used when the target triple OS is
2540 ``amdhsa`` (see :ref:`amdgpu-target-triples`).
2542 .. _amdgpu-amdhsa-code-object-metadata:
2544 Code Object Metadata
2545 ~~~~~~~~~~~~~~~~~~~~
2547 The code object metadata specifies extensible metadata associated with the code
2548 objects executed on HSA [HSA]_ compatible runtimes (see :ref:`amdgpu-os`). The
2549 encoding and semantics of this metadata depends on the code object version; see
2550 :ref:`amdgpu-amdhsa-code-object-metadata-v2`,
2551 :ref:`amdgpu-amdhsa-code-object-metadata-v3`, and
2552 :ref:`amdgpu-amdhsa-code-object-metadata-v4`.
2554 Code object metadata is specified in a note record (see
2555 :ref:`amdgpu-note-records`) and is required when the target triple OS is
2556 ``amdhsa`` (see :ref:`amdgpu-target-triples`). It must contain the minimum
2557 information necessary to support the HSA compatible runtime kernel queries. For
2558 example, the segment sizes needed in a dispatch packet. In addition, a
2559 high-level language runtime may require other information to be included. For
2560 example, the AMD OpenCL runtime records kernel argument information.
2562 .. _amdgpu-amdhsa-code-object-metadata-v2:
2564 Code Object V2 Metadata
2565 +++++++++++++++++++++++
2568 Code object V2 is not the default code object version emitted by this version
2571 Code object V2 metadata is specified by the ``NT_AMD_HSA_METADATA`` note record
2572 (see :ref:`amdgpu-note-records-v2`).
2574 The metadata is specified as a YAML formatted string (see [YAML]_ and
2579 Is the string null terminated? It probably should not if YAML allows it to
2580 contain null characters, otherwise it should be.
2582 The metadata is represented as a single YAML document comprised of the mapping
2583 defined in table :ref:`amdgpu-amdhsa-code-object-metadata-map-v2-table` and
2586 For boolean values, the string values of ``false`` and ``true`` are used for
2587 false and true respectively.
2589 Additional information can be added to the mappings. To avoid conflicts, any
2590 non-AMD key names should be prefixed by "*vendor-name*.".
2592 .. table:: AMDHSA Code Object V2 Metadata Map
2593 :name: amdgpu-amdhsa-code-object-metadata-map-v2-table
2595 ========== ============== ========= =======================================
2596 String Key Value Type Required? Description
2597 ========== ============== ========= =======================================
2598 "Version" sequence of Required - The first integer is the major
2599 2 integers version. Currently 1.
2600 - The second integer is the minor
2601 version. Currently 0.
2602 "Printf" sequence of Each string is encoded information
2603 strings about a printf function call. The
2604 encoded information is organized as
2605 fields separated by colon (':'):
2607 ``ID:N:S[0]:S[1]:...:S[N-1]:FormatString``
2612 A 32-bit integer as a unique id for
2613 each printf function call
2616 A 32-bit integer equal to the number
2617 of arguments of printf function call
2620 ``S[i]`` (where i = 0, 1, ... , N-1)
2621 32-bit integers for the size in bytes
2622 of the i-th FormatString argument of
2623 the printf function call
2626 The format string passed to the
2627 printf function call.
2628 "Kernels" sequence of Required Sequence of the mappings for each
2629 mapping kernel in the code object. See
2630 :ref:`amdgpu-amdhsa-code-object-kernel-metadata-map-v2-table`
2631 for the definition of the mapping.
2632 ========== ============== ========= =======================================
2636 .. table:: AMDHSA Code Object V2 Kernel Metadata Map
2637 :name: amdgpu-amdhsa-code-object-kernel-metadata-map-v2-table
2639 ================= ============== ========= ================================
2640 String Key Value Type Required? Description
2641 ================= ============== ========= ================================
2642 "Name" string Required Source name of the kernel.
2643 "SymbolName" string Required Name of the kernel
2644 descriptor ELF symbol.
2645 "Language" string Source language of the kernel.
2653 "LanguageVersion" sequence of - The first integer is the major
2655 - The second integer is the
2657 "Attrs" mapping Mapping of kernel attributes.
2659 :ref:`amdgpu-amdhsa-code-object-kernel-attribute-metadata-map-v2-table`
2660 for the mapping definition.
2661 "Args" sequence of Sequence of mappings of the
2662 mapping kernel arguments. See
2663 :ref:`amdgpu-amdhsa-code-object-kernel-argument-metadata-map-v2-table`
2664 for the definition of the mapping.
2665 "CodeProps" mapping Mapping of properties related to
2666 the kernel code. See
2667 :ref:`amdgpu-amdhsa-code-object-kernel-code-properties-metadata-map-v2-table`
2668 for the mapping definition.
2669 ================= ============== ========= ================================
2673 .. table:: AMDHSA Code Object V2 Kernel Attribute Metadata Map
2674 :name: amdgpu-amdhsa-code-object-kernel-attribute-metadata-map-v2-table
2676 =================== ============== ========= ==============================
2677 String Key Value Type Required? Description
2678 =================== ============== ========= ==============================
2679 "ReqdWorkGroupSize" sequence of If not 0, 0, 0 then all values
2680 3 integers must be >=1 and the dispatch
2681 work-group size X, Y, Z must
2682 correspond to the specified
2683 values. Defaults to 0, 0, 0.
2685 Corresponds to the OpenCL
2686 ``reqd_work_group_size``
2688 "WorkGroupSizeHint" sequence of The dispatch work-group size
2689 3 integers X, Y, Z is likely to be the
2692 Corresponds to the OpenCL
2693 ``work_group_size_hint``
2695 "VecTypeHint" string The name of a scalar or vector
2698 Corresponds to the OpenCL
2699 ``vec_type_hint`` attribute.
2701 "RuntimeHandle" string The external symbol name
2702 associated with a kernel.
2703 OpenCL runtime allocates a
2704 global buffer for the symbol
2705 and saves the kernel's address
2706 to it, which is used for
2707 device side enqueueing. Only
2708 available for device side
2710 =================== ============== ========= ==============================
2714 .. table:: AMDHSA Code Object V2 Kernel Argument Metadata Map
2715 :name: amdgpu-amdhsa-code-object-kernel-argument-metadata-map-v2-table
2717 ================= ============== ========= ================================
2718 String Key Value Type Required? Description
2719 ================= ============== ========= ================================
2720 "Name" string Kernel argument name.
2721 "TypeName" string Kernel argument type name.
2722 "Size" integer Required Kernel argument size in bytes.
2723 "Align" integer Required Kernel argument alignment in
2724 bytes. Must be a power of two.
2725 "ValueKind" string Required Kernel argument kind that
2726 specifies how to set up the
2727 corresponding argument.
2731 The argument is copied
2732 directly into the kernarg.
2735 A global address space pointer
2736 to the buffer data is passed
2739 "DynamicSharedPointer"
2740 A group address space pointer
2741 to dynamically allocated LDS
2742 is passed in the kernarg.
2745 A global address space
2746 pointer to a S# is passed in
2750 A global address space
2751 pointer to a T# is passed in
2755 A global address space pointer
2756 to an OpenCL pipe is passed in
2760 A global address space pointer
2761 to an OpenCL device enqueue
2762 queue is passed in the
2765 "HiddenGlobalOffsetX"
2766 The OpenCL grid dispatch
2767 global offset for the X
2768 dimension is passed in the
2771 "HiddenGlobalOffsetY"
2772 The OpenCL grid dispatch
2773 global offset for the Y
2774 dimension is passed in the
2777 "HiddenGlobalOffsetZ"
2778 The OpenCL grid dispatch
2779 global offset for the Z
2780 dimension is passed in the
2784 An argument that is not used
2785 by the kernel. Space needs to
2786 be left for it, but it does
2787 not need to be set up.
2789 "HiddenPrintfBuffer"
2790 A global address space pointer
2791 to the runtime printf buffer
2792 is passed in kernarg.
2794 "HiddenHostcallBuffer"
2795 A global address space pointer
2796 to the runtime hostcall buffer
2797 is passed in kernarg.
2799 "HiddenDefaultQueue"
2800 A global address space pointer
2801 to the OpenCL device enqueue
2802 queue that should be used by
2803 the kernel by default is
2804 passed in the kernarg.
2806 "HiddenCompletionAction"
2807 A global address space pointer
2808 to help link enqueued kernels into
2809 the ancestor tree for determining
2810 when the parent kernel has finished.
2812 "HiddenMultiGridSyncArg"
2813 A global address space pointer for
2814 multi-grid synchronization is
2815 passed in the kernarg.
2817 "ValueType" string Unused and deprecated. This should no longer
2818 be emitted, but is accepted for compatibility.
2821 "PointeeAlign" integer Alignment in bytes of pointee
2822 type for pointer type kernel
2823 argument. Must be a power
2824 of 2. Only present if
2826 "DynamicSharedPointer".
2827 "AddrSpaceQual" string Kernel argument address space
2828 qualifier. Only present if
2829 "ValueKind" is "GlobalBuffer" or
2830 "DynamicSharedPointer". Values
2842 Is GlobalBuffer only Global
2844 DynamicSharedPointer always
2845 Local? Can HCC allow Generic?
2846 How can Private or Region
2849 "AccQual" string Kernel argument access
2850 qualifier. Only present if
2851 "ValueKind" is "Image" or
2864 "ActualAccQual" string The actual memory accesses
2865 performed by the kernel on the
2866 kernel argument. Only present if
2867 "ValueKind" is "GlobalBuffer",
2868 "Image", or "Pipe". This may be
2869 more restrictive than indicated
2870 by "AccQual" to reflect what the
2871 kernel actual does. If not
2872 present then the runtime must
2873 assume what is implied by
2874 "AccQual" and "IsConst". Values
2881 "IsConst" boolean Indicates if the kernel argument
2882 is const qualified. Only present
2886 "IsRestrict" boolean Indicates if the kernel argument
2887 is restrict qualified. Only
2888 present if "ValueKind" is
2891 "IsVolatile" boolean Indicates if the kernel argument
2892 is volatile qualified. Only
2893 present if "ValueKind" is
2896 "IsPipe" boolean Indicates if the kernel argument
2897 is pipe qualified. Only present
2898 if "ValueKind" is "Pipe".
2902 Can GlobalBuffer be pipe
2905 ================= ============== ========= ================================
2909 .. table:: AMDHSA Code Object V2 Kernel Code Properties Metadata Map
2910 :name: amdgpu-amdhsa-code-object-kernel-code-properties-metadata-map-v2-table
2912 ============================ ============== ========= =====================
2913 String Key Value Type Required? Description
2914 ============================ ============== ========= =====================
2915 "KernargSegmentSize" integer Required The size in bytes of
2917 that holds the values
2920 "GroupSegmentFixedSize" integer Required The amount of group
2924 bytes. This does not
2926 dynamically allocated
2927 group segment memory
2931 "PrivateSegmentFixedSize" integer Required The amount of fixed
2932 private address space
2933 memory required for a
2935 bytes. If the kernel
2937 stack then additional
2939 to this value for the
2941 "KernargSegmentAlign" integer Required The maximum byte
2944 kernarg segment. Must
2946 "WavefrontSize" integer Required Wavefront size. Must
2948 "NumSGPRs" integer Required Number of scalar
2952 includes the special
2954 Scratch (GFX7-GFX10)
2956 GFX8-GFX10). It does
2958 SGPR added if a trap
2964 "NumVGPRs" integer Required Number of vector
2968 "MaxFlatWorkGroupSize" integer Required Maximum flat
2971 kernel in work-items.
2974 ReqdWorkGroupSize if
2976 "NumSpilledSGPRs" integer Number of stores from
2977 a scalar register to
2978 a register allocator
2981 "NumSpilledVGPRs" integer Number of stores from
2982 a vector register to
2983 a register allocator
2986 ============================ ============== ========= =====================
2988 .. _amdgpu-amdhsa-code-object-metadata-v3:
2990 Code Object V3 Metadata
2991 +++++++++++++++++++++++
2993 Code object V3 to V4 metadata is specified by the ``NT_AMDGPU_METADATA`` note
2994 record (see :ref:`amdgpu-note-records-v3-v4`).
2996 The metadata is represented as Message Pack formatted binary data (see
2997 [MsgPack]_). The top level is a Message Pack map that includes the
2998 keys defined in table
2999 :ref:`amdgpu-amdhsa-code-object-metadata-map-table-v3` and referenced
3002 Additional information can be added to the maps. To avoid conflicts,
3003 any key names should be prefixed by "*vendor-name*." where
3004 ``vendor-name`` can be the name of the vendor and specific vendor
3005 tool that generates the information. The prefix is abbreviated to
3006 simply "." when it appears within a map that has been added by the
3009 .. table:: AMDHSA Code Object V3 Metadata Map
3010 :name: amdgpu-amdhsa-code-object-metadata-map-table-v3
3012 ================= ============== ========= =======================================
3013 String Key Value Type Required? Description
3014 ================= ============== ========= =======================================
3015 "amdhsa.version" sequence of Required - The first integer is the major
3016 2 integers version. Currently 1.
3017 - The second integer is the minor
3018 version. Currently 0.
3019 "amdhsa.printf" sequence of Each string is encoded information
3020 strings about a printf function call. The
3021 encoded information is organized as
3022 fields separated by colon (':'):
3024 ``ID:N:S[0]:S[1]:...:S[N-1]:FormatString``
3029 A 32-bit integer as a unique id for
3030 each printf function call
3033 A 32-bit integer equal to the number
3034 of arguments of printf function call
3037 ``S[i]`` (where i = 0, 1, ... , N-1)
3038 32-bit integers for the size in bytes
3039 of the i-th FormatString argument of
3040 the printf function call
3043 The format string passed to the
3044 printf function call.
3045 "amdhsa.kernels" sequence of Required Sequence of the maps for each
3046 map kernel in the code object. See
3047 :ref:`amdgpu-amdhsa-code-object-kernel-metadata-map-table-v3`
3048 for the definition of the keys included
3050 ================= ============== ========= =======================================
3054 .. table:: AMDHSA Code Object V3 Kernel Metadata Map
3055 :name: amdgpu-amdhsa-code-object-kernel-metadata-map-table-v3
3057 =================================== ============== ========= ================================
3058 String Key Value Type Required? Description
3059 =================================== ============== ========= ================================
3060 ".name" string Required Source name of the kernel.
3061 ".symbol" string Required Name of the kernel
3062 descriptor ELF symbol.
3063 ".language" string Source language of the kernel.
3073 ".language_version" sequence of - The first integer is the major
3075 - The second integer is the
3077 ".args" sequence of Sequence of maps of the
3078 map kernel arguments. See
3079 :ref:`amdgpu-amdhsa-code-object-kernel-argument-metadata-map-table-v3`
3080 for the definition of the keys
3081 included in that map.
3082 ".reqd_workgroup_size" sequence of If not 0, 0, 0 then all values
3083 3 integers must be >=1 and the dispatch
3084 work-group size X, Y, Z must
3085 correspond to the specified
3086 values. Defaults to 0, 0, 0.
3088 Corresponds to the OpenCL
3089 ``reqd_work_group_size``
3091 ".workgroup_size_hint" sequence of The dispatch work-group size
3092 3 integers X, Y, Z is likely to be the
3095 Corresponds to the OpenCL
3096 ``work_group_size_hint``
3098 ".vec_type_hint" string The name of a scalar or vector
3101 Corresponds to the OpenCL
3102 ``vec_type_hint`` attribute.
3104 ".device_enqueue_symbol" string The external symbol name
3105 associated with a kernel.
3106 OpenCL runtime allocates a
3107 global buffer for the symbol
3108 and saves the kernel's address
3109 to it, which is used for
3110 device side enqueueing. Only
3111 available for device side
3113 ".kernarg_segment_size" integer Required The size in bytes of
3115 that holds the values
3118 ".group_segment_fixed_size" integer Required The amount of group
3122 bytes. This does not
3124 dynamically allocated
3125 group segment memory
3129 ".private_segment_fixed_size" integer Required The amount of fixed
3130 private address space
3131 memory required for a
3133 bytes. If the kernel
3135 stack then additional
3137 to this value for the
3139 ".kernarg_segment_align" integer Required The maximum byte
3142 kernarg segment. Must
3144 ".wavefront_size" integer Required Wavefront size. Must
3146 ".sgpr_count" integer Required Number of scalar
3147 registers required by a
3149 GFX6-GFX9. A register
3150 is required if it is
3152 if a higher numbered
3155 includes the special
3161 SGPR added if a trap
3167 ".vgpr_count" integer Required Number of vector
3168 registers required by
3170 GFX6-GFX9. A register
3171 is required if it is
3173 if a higher numbered
3176 ".max_flat_workgroup_size" integer Required Maximum flat
3179 kernel in work-items.
3182 ReqdWorkGroupSize if
3184 ".sgpr_spill_count" integer Number of stores from
3185 a scalar register to
3186 a register allocator
3189 ".vgpr_spill_count" integer Number of stores from
3190 a vector register to
3191 a register allocator
3194 ".kind" string The kind of the kernel
3202 These kernels must be
3203 invoked after loading
3213 These kernels must be
3216 containing code object
3217 and after all init and
3218 normal kernels in the
3219 same code object have
3223 If omitted, "normal" is
3225 =================================== ============== ========= ================================
3229 .. table:: AMDHSA Code Object V3 Kernel Argument Metadata Map
3230 :name: amdgpu-amdhsa-code-object-kernel-argument-metadata-map-table-v3
3232 ====================== ============== ========= ================================
3233 String Key Value Type Required? Description
3234 ====================== ============== ========= ================================
3235 ".name" string Kernel argument name.
3236 ".type_name" string Kernel argument type name.
3237 ".size" integer Required Kernel argument size in bytes.
3238 ".offset" integer Required Kernel argument offset in
3239 bytes. The offset must be a
3240 multiple of the alignment
3241 required by the argument.
3242 ".value_kind" string Required Kernel argument kind that
3243 specifies how to set up the
3244 corresponding argument.
3248 The argument is copied
3249 directly into the kernarg.
3252 A global address space pointer
3253 to the buffer data is passed
3256 "dynamic_shared_pointer"
3257 A group address space pointer
3258 to dynamically allocated LDS
3259 is passed in the kernarg.
3262 A global address space
3263 pointer to a S# is passed in
3267 A global address space
3268 pointer to a T# is passed in
3272 A global address space pointer
3273 to an OpenCL pipe is passed in
3277 A global address space pointer
3278 to an OpenCL device enqueue
3279 queue is passed in the
3282 "hidden_global_offset_x"
3283 The OpenCL grid dispatch
3284 global offset for the X
3285 dimension is passed in the
3288 "hidden_global_offset_y"
3289 The OpenCL grid dispatch
3290 global offset for the Y
3291 dimension is passed in the
3294 "hidden_global_offset_z"
3295 The OpenCL grid dispatch
3296 global offset for the Z
3297 dimension is passed in the
3301 An argument that is not used
3302 by the kernel. Space needs to
3303 be left for it, but it does
3304 not need to be set up.
3306 "hidden_printf_buffer"
3307 A global address space pointer
3308 to the runtime printf buffer
3309 is passed in kernarg.
3311 "hidden_hostcall_buffer"
3312 A global address space pointer
3313 to the runtime hostcall buffer
3314 is passed in kernarg.
3316 "hidden_default_queue"
3317 A global address space pointer
3318 to the OpenCL device enqueue
3319 queue that should be used by
3320 the kernel by default is
3321 passed in the kernarg.
3323 "hidden_completion_action"
3324 A global address space pointer
3325 to help link enqueued kernels into
3326 the ancestor tree for determining
3327 when the parent kernel has finished.
3329 "hidden_multigrid_sync_arg"
3330 A global address space pointer for
3331 multi-grid synchronization is
3332 passed in the kernarg.
3334 ".value_type" string Unused and deprecated. This should no longer
3335 be emitted, but is accepted for compatibility.
3337 ".pointee_align" integer Alignment in bytes of pointee
3338 type for pointer type kernel
3339 argument. Must be a power
3340 of 2. Only present if
3342 "dynamic_shared_pointer".
3343 ".address_space" string Kernel argument address space
3344 qualifier. Only present if
3345 ".value_kind" is "global_buffer" or
3346 "dynamic_shared_pointer". Values
3358 Is "global_buffer" only "global"
3360 "dynamic_shared_pointer" always
3361 "local"? Can HCC allow "generic"?
3362 How can "private" or "region"
3365 ".access" string Kernel argument access
3366 qualifier. Only present if
3367 ".value_kind" is "image" or
3380 ".actual_access" string The actual memory accesses
3381 performed by the kernel on the
3382 kernel argument. Only present if
3383 ".value_kind" is "global_buffer",
3384 "image", or "pipe". This may be
3385 more restrictive than indicated
3386 by ".access" to reflect what the
3387 kernel actual does. If not
3388 present then the runtime must
3389 assume what is implied by
3390 ".access" and ".is_const" . Values
3397 ".is_const" boolean Indicates if the kernel argument
3398 is const qualified. Only present
3402 ".is_restrict" boolean Indicates if the kernel argument
3403 is restrict qualified. Only
3404 present if ".value_kind" is
3407 ".is_volatile" boolean Indicates if the kernel argument
3408 is volatile qualified. Only
3409 present if ".value_kind" is
3412 ".is_pipe" boolean Indicates if the kernel argument
3413 is pipe qualified. Only present
3414 if ".value_kind" is "pipe".
3418 Can "global_buffer" be pipe
3421 ====================== ============== ========= ================================
3423 .. _amdgpu-amdhsa-code-object-metadata-v4:
3425 Code Object V4 Metadata
3426 +++++++++++++++++++++++
3429 Code object V4 is not the default code object version emitted by this version
3432 Code object V4 metadata is the same as
3433 :ref:`amdgpu-amdhsa-code-object-metadata-v3` with the changes and additions
3434 defined in table :ref:`amdgpu-amdhsa-code-object-metadata-map-table-v3`.
3436 .. table:: AMDHSA Code Object V4 Metadata Map Changes from :ref:`amdgpu-amdhsa-code-object-metadata-v3`
3437 :name: amdgpu-amdhsa-code-object-metadata-map-table-v4
3439 ================= ============== ========= =======================================
3440 String Key Value Type Required? Description
3441 ================= ============== ========= =======================================
3442 "amdhsa.version" sequence of Required - The first integer is the major
3443 2 integers version. Currently 1.
3444 - The second integer is the minor
3445 version. Currently 1.
3446 "amdhsa.target" string Required The target name of the code using the syntax:
3450 <target-triple> [ "-" <target-id> ]
3452 A canonical target ID must be
3453 used. See :ref:`amdgpu-target-triples`
3454 and :ref:`amdgpu-target-id`.
3455 ================= ============== ========= =======================================
3462 The HSA architected queuing language (AQL) defines a user space memory interface
3463 that can be used to control the dispatch of kernels, in an agent independent
3464 way. An agent can have zero or more AQL queues created for it using an HSA
3465 compatible runtime (see :ref:`amdgpu-os`), in which AQL packets (all of which
3466 are 64 bytes) can be placed. See the *HSA Platform System Architecture
3467 Specification* [HSA]_ for the AQL queue mechanics and packet layouts.
3469 The packet processor of a kernel agent is responsible for detecting and
3470 dispatching HSA kernels from the AQL queues associated with it. For AMD GPUs the
3471 packet processor is implemented by the hardware command processor (CP),
3472 asynchronous dispatch controller (ADC) and shader processor input controller
3475 An HSA compatible runtime can be used to allocate an AQL queue object. It uses
3476 the kernel mode driver to initialize and register the AQL queue with CP.
3478 To dispatch a kernel the following actions are performed. This can occur in the
3479 CPU host program, or from an HSA kernel executing on a GPU.
3481 1. A pointer to an AQL queue for the kernel agent on which the kernel is to be
3482 executed is obtained.
3483 2. A pointer to the kernel descriptor (see
3484 :ref:`amdgpu-amdhsa-kernel-descriptor`) of the kernel to execute is obtained.
3485 It must be for a kernel that is contained in a code object that that was
3486 loaded by an HSA compatible runtime on the kernel agent with which the AQL
3487 queue is associated.
3488 3. Space is allocated for the kernel arguments using the HSA compatible runtime
3489 allocator for a memory region with the kernarg property for the kernel agent
3490 that will execute the kernel. It must be at least 16-byte aligned.
3491 4. Kernel argument values are assigned to the kernel argument memory
3492 allocation. The layout is defined in the *HSA Programmer's Language
3493 Reference* [HSA]_. For AMDGPU the kernel execution directly accesses the
3494 kernel argument memory in the same way constant memory is accessed. (Note
3495 that the HSA specification allows an implementation to copy the kernel
3496 argument contents to another location that is accessed by the kernel.)
3497 5. An AQL kernel dispatch packet is created on the AQL queue. The HSA compatible
3498 runtime api uses 64-bit atomic operations to reserve space in the AQL queue
3499 for the packet. The packet must be set up, and the final write must use an
3500 atomic store release to set the packet kind to ensure the packet contents are
3501 visible to the kernel agent. AQL defines a doorbell signal mechanism to
3502 notify the kernel agent that the AQL queue has been updated. These rules, and
3503 the layout of the AQL queue and kernel dispatch packet is defined in the *HSA
3504 System Architecture Specification* [HSA]_.
3505 6. A kernel dispatch packet includes information about the actual dispatch,
3506 such as grid and work-group size, together with information from the code
3507 object about the kernel, such as segment sizes. The HSA compatible runtime
3508 queries on the kernel symbol can be used to obtain the code object values
3509 which are recorded in the :ref:`amdgpu-amdhsa-code-object-metadata`.
3510 7. CP executes micro-code and is responsible for detecting and setting up the
3511 GPU to execute the wavefronts of a kernel dispatch.
3512 8. CP ensures that when the a wavefront starts executing the kernel machine
3513 code, the scalar general purpose registers (SGPR) and vector general purpose
3514 registers (VGPR) are set up as required by the machine code. The required
3515 setup is defined in the :ref:`amdgpu-amdhsa-kernel-descriptor`. The initial
3516 register state is defined in
3517 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`.
3518 9. The prolog of the kernel machine code (see
3519 :ref:`amdgpu-amdhsa-kernel-prolog`) sets up the machine state as necessary
3520 before continuing executing the machine code that corresponds to the kernel.
3521 10. When the kernel dispatch has completed execution, CP signals the completion
3522 signal specified in the kernel dispatch packet if not 0.
3524 .. _amdgpu-amdhsa-memory-spaces:
3529 The memory space properties are:
3531 .. table:: AMDHSA Memory Spaces
3532 :name: amdgpu-amdhsa-memory-spaces-table
3534 ================= =========== ======== ======= ==================
3535 Memory Space Name HSA Segment Hardware Address NULL Value
3537 ================= =========== ======== ======= ==================
3538 Private private scratch 32 0x00000000
3539 Local group LDS 32 0xFFFFFFFF
3540 Global global global 64 0x0000000000000000
3541 Constant constant *same as 64 0x0000000000000000
3543 Generic flat flat 64 0x0000000000000000
3544 Region N/A GDS 32 *not implemented
3546 ================= =========== ======== ======= ==================
3548 The global and constant memory spaces both use global virtual addresses, which
3549 are the same virtual address space used by the CPU. However, some virtual
3550 addresses may only be accessible to the CPU, some only accessible by the GPU,
3553 Using the constant memory space indicates that the data will not change during
3554 the execution of the kernel. This allows scalar read instructions to be
3555 used. The vector and scalar L1 caches are invalidated of volatile data before
3556 each kernel dispatch execution to allow constant memory to change values between
3559 The local memory space uses the hardware Local Data Store (LDS) which is
3560 automatically allocated when the hardware creates work-groups of wavefronts, and
3561 freed when all the wavefronts of a work-group have terminated. The data store
3562 (DS) instructions can be used to access it.
3564 The private memory space uses the hardware scratch memory support. If the kernel
3565 uses scratch, then the hardware allocates memory that is accessed using
3566 wavefront lane dword (4 byte) interleaving. The mapping used from private
3567 address to physical address is:
3569 ``wavefront-scratch-base +
3570 (private-address * wavefront-size * 4) +
3571 (wavefront-lane-id * 4)``
3573 There are different ways that the wavefront scratch base address is determined
3574 by a wavefront (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). This
3575 memory can be accessed in an interleaved manner using buffer instruction with
3576 the scratch buffer descriptor and per wavefront scratch offset, by the scratch
3577 instructions, or by flat instructions. If each lane of a wavefront accesses the
3578 same private address, the interleaving results in adjacent dwords being accessed
3579 and hence requires fewer cache lines to be fetched. Multi-dword access is not
3580 supported except by flat and scratch instructions in GFX9-GFX10.
3582 The generic address space uses the hardware flat address support available in
3583 GFX7-GFX10. This uses two fixed ranges of virtual addresses (the private and
3584 local apertures), that are outside the range of addressible global memory, to
3585 map from a flat address to a private or local address.
3587 FLAT instructions can take a flat address and access global, private (scratch)
3588 and group (LDS) memory depending in if the address is within one of the
3589 aperture ranges. Flat access to scratch requires hardware aperture setup and
3590 setup in the kernel prologue (see
3591 :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`). Flat access to LDS requires
3592 hardware aperture setup and M0 (GFX7-GFX8) register setup (see
3593 :ref:`amdgpu-amdhsa-kernel-prolog-m0`).
3595 To convert between a segment address and a flat address the base address of the
3596 apertures address can be used. For GFX7-GFX8 these are available in the
3597 :ref:`amdgpu-amdhsa-hsa-aql-queue` the address of which can be obtained with
3598 Queue Ptr SGPR (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). For
3599 GFX9-GFX10 the aperture base addresses are directly available as inline constant
3600 registers ``SRC_SHARED_BASE/LIMIT`` and ``SRC_PRIVATE_BASE/LIMIT``. In 64 bit
3601 address mode the aperture sizes are 2^32 bytes and the base is aligned to 2^32
3602 which makes it easier to convert from flat to segment or segment to flat.
3607 Image and sample handles created by an HSA compatible runtime (see
3608 :ref:`amdgpu-os`) are 64-bit addresses of a hardware 32-byte V# and 48 byte S#
3609 object respectively. In order to support the HSA ``query_sampler`` operations
3610 two extra dwords are used to store the HSA BRIG enumeration values for the
3611 queries that are not trivially deducible from the S# representation.
3616 HSA signal handles created by an HSA compatible runtime (see :ref:`amdgpu-os`)
3617 are 64-bit addresses of a structure allocated in memory accessible from both the
3618 CPU and GPU. The structure is defined by the runtime and subject to change
3619 between releases. For example, see [AMD-ROCm-github]_.
3621 .. _amdgpu-amdhsa-hsa-aql-queue:
3626 The HSA AQL queue structure is defined by an HSA compatible runtime (see
3627 :ref:`amdgpu-os`) and subject to change between releases. For example, see
3628 [AMD-ROCm-github]_. For some processors it contains fields needed to implement
3629 certain language features such as the flat address aperture bases. It also
3630 contains fields used by CP such as managing the allocation of scratch memory.
3632 .. _amdgpu-amdhsa-kernel-descriptor:
3637 A kernel descriptor consists of the information needed by CP to initiate the
3638 execution of a kernel, including the entry point address of the machine code
3639 that implements the kernel.
3641 Code Object V3 Kernel Descriptor
3642 ++++++++++++++++++++++++++++++++
3644 CP microcode requires the Kernel descriptor to be allocated on 64-byte
3647 The fields used by CP for code objects before V3 also match those specified in
3648 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
3650 .. table:: Code Object V3 Kernel Descriptor
3651 :name: amdgpu-amdhsa-kernel-descriptor-v3-table
3653 ======= ======= =============================== ============================
3654 Bits Size Field Name Description
3655 ======= ======= =============================== ============================
3656 31:0 4 bytes GROUP_SEGMENT_FIXED_SIZE The amount of fixed local
3657 address space memory
3658 required for a work-group
3659 in bytes. This does not
3660 include any dynamically
3661 allocated local address
3662 space memory that may be
3663 added when the kernel is
3665 63:32 4 bytes PRIVATE_SEGMENT_FIXED_SIZE The amount of fixed
3666 private address space
3667 memory required for a
3669 Additional space may need to
3670 be added to this value if
3672 non-inlined function calls.
3673 95:64 4 bytes KERNARG_SIZE The size of the kernarg
3674 memory pointed to by the
3675 AQL dispatch packet. The
3676 kernarg memory is used to
3677 pass arguments to the
3680 * If the kernarg pointer in
3681 the dispatch packet is NULL
3682 then there are no kernel
3684 * If the kernarg pointer in
3685 the dispatch packet is
3686 not NULL and this value
3687 is 0 then the kernarg
3690 * If the kernarg pointer in
3691 the dispatch packet is
3692 not NULL and this value
3693 is not 0 then the value
3694 specifies the kernarg
3695 memory size in bytes. It
3696 is recommended to provide
3697 a value as it may be used
3698 by CP to optimize making
3700 visible to the kernel
3703 127:96 4 bytes Reserved, must be 0.
3704 191:128 8 bytes KERNEL_CODE_ENTRY_BYTE_OFFSET Byte offset (possibly
3707 descriptor to kernel's
3708 entry point instruction
3709 which must be 256 byte
3711 351:272 20 Reserved, must be 0.
3713 383:352 4 bytes COMPUTE_PGM_RSRC3 GFX6-GFX9
3714 Reserved, must be 0.
3717 program settings used by
3719 ``COMPUTE_PGM_RSRC3``
3722 :ref:`amdgpu-amdhsa-compute_pgm_rsrc3-gfx90a-table`.
3725 program settings used by
3727 ``COMPUTE_PGM_RSRC3``
3730 :ref:`amdgpu-amdhsa-compute_pgm_rsrc3-gfx10-table`.
3731 415:384 4 bytes COMPUTE_PGM_RSRC1 Compute Shader (CS)
3732 program settings used by
3734 ``COMPUTE_PGM_RSRC1``
3737 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
3738 447:416 4 bytes COMPUTE_PGM_RSRC2 Compute Shader (CS)
3739 program settings used by
3741 ``COMPUTE_PGM_RSRC2``
3744 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`.
3745 458:448 7 bits *See separate bits below.* Enable the setup of the
3746 SGPR user data registers
3748 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
3750 The total number of SGPR
3752 requested must not exceed
3753 16 and match value in
3754 ``compute_pgm_rsrc2.user_sgpr.user_sgpr_count``.
3755 Any requests beyond 16
3757 >448 1 bit ENABLE_SGPR_PRIVATE_SEGMENT If the *Target Properties*
3759 :ref:`amdgpu-processor-table`
3760 specifies *Architected flat
3761 scratch* then not supported
3763 >449 1 bit ENABLE_SGPR_DISPATCH_PTR
3764 >450 1 bit ENABLE_SGPR_QUEUE_PTR
3765 >451 1 bit ENABLE_SGPR_KERNARG_SEGMENT_PTR
3766 >452 1 bit ENABLE_SGPR_DISPATCH_ID
3767 >453 1 bit ENABLE_SGPR_FLAT_SCRATCH_INIT If the *Target Properties*
3769 :ref:`amdgpu-processor-table`
3770 specifies *Architected flat
3771 scratch* then not supported
3773 >454 1 bit ENABLE_SGPR_PRIVATE_SEGMENT
3775 457:455 3 bits Reserved, must be 0.
3776 458 1 bit ENABLE_WAVEFRONT_SIZE32 GFX6-GFX9
3777 Reserved, must be 0.
3780 wavefront size 64 mode.
3782 native wavefront size
3784 463:459 1 bit Reserved, must be 0.
3785 464 1 bit RESERVED_464 Deprecated, must be 0.
3786 467:465 3 bits Reserved, must be 0.
3787 468 1 bit RESERVED_468 Deprecated, must be 0.
3788 469:471 3 bits Reserved, must be 0.
3789 511:472 5 bytes Reserved, must be 0.
3790 512 **Total size 64 bytes.**
3791 ======= ====================================================================
3795 .. table:: compute_pgm_rsrc1 for GFX6-GFX10
3796 :name: amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table
3798 ======= ======= =============================== ===========================================================================
3799 Bits Size Field Name Description
3800 ======= ======= =============================== ===========================================================================
3801 5:0 6 bits GRANULATED_WORKITEM_VGPR_COUNT Number of vector register
3802 blocks used by each work-item;
3803 granularity is device
3808 - max(0, ceil(vgprs_used / 4) - 1)
3811 - vgprs_used = align(arch_vgprs, 4)
3813 - max(0, ceil(vgprs_used / 8) - 1)
3814 GFX10 (wavefront size 64)
3816 - max(0, ceil(vgprs_used / 4) - 1)
3817 GFX10 (wavefront size 32)
3819 - max(0, ceil(vgprs_used / 8) - 1)
3821 Where vgprs_used is defined
3822 as the highest VGPR number
3823 explicitly referenced plus
3826 Used by CP to set up
3827 ``COMPUTE_PGM_RSRC1.VGPRS``.
3830 :ref:`amdgpu-assembler`
3832 automatically for the
3833 selected processor from
3834 values provided to the
3835 `.amdhsa_kernel` directive
3837 `.amdhsa_next_free_vgpr`
3838 nested directive (see
3839 :ref:`amdhsa-kernel-directives-table`).
3840 9:6 4 bits GRANULATED_WAVEFRONT_SGPR_COUNT Number of scalar register
3841 blocks used by a wavefront;
3842 granularity is device
3847 - max(0, ceil(sgprs_used / 8) - 1)
3850 - 2 * max(0, ceil(sgprs_used / 16) - 1)
3852 Reserved, must be 0.
3857 defined as the highest
3858 SGPR number explicitly
3859 referenced plus one, plus
3860 a target specific number
3861 of additional special
3863 FLAT_SCRATCH (GFX7+) and
3864 XNACK_MASK (GFX8+), and
3867 limitations. It does not
3868 include the 16 SGPRs added
3869 if a trap handler is
3873 limitations and special
3874 SGPR layout are defined in
3876 documentation, which can
3878 :ref:`amdgpu-processors`
3881 Used by CP to set up
3882 ``COMPUTE_PGM_RSRC1.SGPRS``.
3885 :ref:`amdgpu-assembler`
3887 automatically for the
3888 selected processor from
3889 values provided to the
3890 `.amdhsa_kernel` directive
3892 `.amdhsa_next_free_sgpr`
3893 and `.amdhsa_reserve_*`
3894 nested directives (see
3895 :ref:`amdhsa-kernel-directives-table`).
3896 11:10 2 bits PRIORITY Must be 0.
3898 Start executing wavefront
3899 at the specified priority.
3901 CP is responsible for
3903 ``COMPUTE_PGM_RSRC1.PRIORITY``.
3904 13:12 2 bits FLOAT_ROUND_MODE_32 Wavefront starts execution
3905 with specified rounding
3908 precision floating point
3911 Floating point rounding
3912 mode values are defined in
3913 :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
3915 Used by CP to set up
3916 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
3917 15:14 2 bits FLOAT_ROUND_MODE_16_64 Wavefront starts execution
3918 with specified rounding
3919 denorm mode for half/double (16
3920 and 64-bit) floating point
3921 precision floating point
3924 Floating point rounding
3925 mode values are defined in
3926 :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
3928 Used by CP to set up
3929 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
3930 17:16 2 bits FLOAT_DENORM_MODE_32 Wavefront starts execution
3931 with specified denorm mode
3934 precision floating point
3937 Floating point denorm mode
3938 values are defined in
3939 :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
3941 Used by CP to set up
3942 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
3943 19:18 2 bits FLOAT_DENORM_MODE_16_64 Wavefront starts execution
3944 with specified denorm mode
3946 and 64-bit) floating point
3947 precision floating point
3950 Floating point denorm mode
3951 values are defined in
3952 :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
3954 Used by CP to set up
3955 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
3956 20 1 bit PRIV Must be 0.
3958 Start executing wavefront
3959 in privilege trap handler
3962 CP is responsible for
3964 ``COMPUTE_PGM_RSRC1.PRIV``.
3965 21 1 bit ENABLE_DX10_CLAMP Wavefront starts execution
3966 with DX10 clamp mode
3967 enabled. Used by the vector
3968 ALU to force DX10 style
3969 treatment of NaN's (when
3970 set, clamp NaN to zero,
3974 Used by CP to set up
3975 ``COMPUTE_PGM_RSRC1.DX10_CLAMP``.
3976 22 1 bit DEBUG_MODE Must be 0.
3978 Start executing wavefront
3979 in single step mode.
3981 CP is responsible for
3983 ``COMPUTE_PGM_RSRC1.DEBUG_MODE``.
3984 23 1 bit ENABLE_IEEE_MODE Wavefront starts execution
3986 enabled. Floating point
3987 opcodes that support
3988 exception flag gathering
3989 will quiet and propagate
3990 signaling-NaN inputs per
3991 IEEE 754-2008. Min_dx10 and
3992 max_dx10 become IEEE
3993 754-2008 compliant due to
3994 signaling-NaN propagation
3997 Used by CP to set up
3998 ``COMPUTE_PGM_RSRC1.IEEE_MODE``.
3999 24 1 bit BULKY Must be 0.
4001 Only one work-group allowed
4002 to execute on a compute
4005 CP is responsible for
4007 ``COMPUTE_PGM_RSRC1.BULKY``.
4008 25 1 bit CDBG_USER Must be 0.
4010 Flag that can be used to
4011 control debugging code.
4013 CP is responsible for
4015 ``COMPUTE_PGM_RSRC1.CDBG_USER``.
4016 26 1 bit FP16_OVFL GFX6-GFX8
4017 Reserved, must be 0.
4019 Wavefront starts execution
4020 with specified fp16 overflow
4023 - If 0, fp16 overflow generates
4025 - If 1, fp16 overflow that is the
4026 result of an +/-INF input value
4027 or divide by 0 produces a +/-INF,
4028 otherwise clamps computed
4029 overflow to +/-MAX_FP16 as
4032 Used by CP to set up
4033 ``COMPUTE_PGM_RSRC1.FP16_OVFL``.
4034 28:27 2 bits Reserved, must be 0.
4035 29 1 bit WGP_MODE GFX6-GFX9
4036 Reserved, must be 0.
4038 - If 0 execute work-groups in
4039 CU wavefront execution mode.
4040 - If 1 execute work-groups on
4041 in WGP wavefront execution mode.
4043 See :ref:`amdgpu-amdhsa-memory-model`.
4045 Used by CP to set up
4046 ``COMPUTE_PGM_RSRC1.WGP_MODE``.
4047 30 1 bit MEM_ORDERED GFX6-GFX9
4048 Reserved, must be 0.
4050 Controls the behavior of the
4051 s_waitcnt's vmcnt and vscnt
4054 - If 0 vmcnt reports completion
4055 of load and atomic with return
4056 out of order with sample
4057 instructions, and the vscnt
4058 reports the completion of
4059 store and atomic without
4061 - If 1 vmcnt reports completion
4062 of load, atomic with return
4063 and sample instructions in
4064 order, and the vscnt reports
4065 the completion of store and
4066 atomic without return in order.
4068 Used by CP to set up
4069 ``COMPUTE_PGM_RSRC1.MEM_ORDERED``.
4070 31 1 bit FWD_PROGRESS GFX6-GFX9
4071 Reserved, must be 0.
4073 - If 0 execute SIMD wavefronts
4074 using oldest first policy.
4075 - If 1 execute SIMD wavefronts to
4076 ensure wavefronts will make some
4079 Used by CP to set up
4080 ``COMPUTE_PGM_RSRC1.FWD_PROGRESS``.
4081 32 **Total size 4 bytes**
4082 ======= ===================================================================================================================
4086 .. table:: compute_pgm_rsrc2 for GFX6-GFX10
4087 :name: amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table
4089 ======= ======= =============================== ===========================================================================
4090 Bits Size Field Name Description
4091 ======= ======= =============================== ===========================================================================
4092 0 1 bit ENABLE_PRIVATE_SEGMENT * Enable the setup of the
4094 * If the *Target Properties*
4096 :ref:`amdgpu-processor-table`
4099 scratch* then enable the
4101 wavefront scratch offset
4102 system register (see
4103 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
4104 * If the *Target Properties*
4106 :ref:`amdgpu-processor-table`
4107 specifies *Architected
4108 flat scratch* then enable
4110 FLAT_SCRATCH register
4112 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
4114 Used by CP to set up
4115 ``COMPUTE_PGM_RSRC2.SCRATCH_EN``.
4116 5:1 5 bits USER_SGPR_COUNT The total number of SGPR
4118 requested. This number must
4119 match the number of user
4120 data registers enabled.
4122 Used by CP to set up
4123 ``COMPUTE_PGM_RSRC2.USER_SGPR``.
4124 6 1 bit ENABLE_TRAP_HANDLER Must be 0.
4127 ``COMPUTE_PGM_RSRC2.TRAP_PRESENT``,
4128 which is set by the CP if
4129 the runtime has installed a
4131 7 1 bit ENABLE_SGPR_WORKGROUP_ID_X Enable the setup of the
4132 system SGPR register for
4133 the work-group id in the X
4135 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
4137 Used by CP to set up
4138 ``COMPUTE_PGM_RSRC2.TGID_X_EN``.
4139 8 1 bit ENABLE_SGPR_WORKGROUP_ID_Y Enable the setup of the
4140 system SGPR register for
4141 the work-group id in the Y
4143 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
4145 Used by CP to set up
4146 ``COMPUTE_PGM_RSRC2.TGID_Y_EN``.
4147 9 1 bit ENABLE_SGPR_WORKGROUP_ID_Z Enable the setup of the
4148 system SGPR register for
4149 the work-group id in the Z
4151 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
4153 Used by CP to set up
4154 ``COMPUTE_PGM_RSRC2.TGID_Z_EN``.
4155 10 1 bit ENABLE_SGPR_WORKGROUP_INFO Enable the setup of the
4156 system SGPR register for
4157 work-group information (see
4158 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
4160 Used by CP to set up
4161 ``COMPUTE_PGM_RSRC2.TGID_SIZE_EN``.
4162 12:11 2 bits ENABLE_VGPR_WORKITEM_ID Enable the setup of the
4163 VGPR system registers used
4164 for the work-item ID.
4165 :ref:`amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table`
4168 Used by CP to set up
4169 ``COMPUTE_PGM_RSRC2.TIDIG_CMP_CNT``.
4170 13 1 bit ENABLE_EXCEPTION_ADDRESS_WATCH Must be 0.
4172 Wavefront starts execution
4174 exceptions enabled which
4175 are generated when L1 has
4176 witnessed a thread access
4180 CP is responsible for
4181 filling in the address
4183 ``COMPUTE_PGM_RSRC2.EXCP_EN_MSB``
4184 according to what the
4186 14 1 bit ENABLE_EXCEPTION_MEMORY Must be 0.
4188 Wavefront starts execution
4189 with memory violation
4190 exceptions exceptions
4191 enabled which are generated
4192 when a memory violation has
4193 occurred for this wavefront from
4195 (write-to-read-only-memory,
4196 mis-aligned atomic, LDS
4197 address out of range,
4198 illegal address, etc.).
4202 ``COMPUTE_PGM_RSRC2.EXCP_EN_MSB``
4203 according to what the
4205 23:15 9 bits GRANULATED_LDS_SIZE Must be 0.
4207 CP uses the rounded value
4208 from the dispatch packet,
4209 not this value, as the
4210 dispatch may contain
4211 dynamically allocated group
4212 segment memory. CP writes
4214 ``COMPUTE_PGM_RSRC2.LDS_SIZE``.
4216 Amount of group segment
4217 (LDS) to allocate for each
4218 work-group. Granularity is
4222 roundup(lds-size / (64 * 4))
4224 roundup(lds-size / (128 * 4))
4226 24 1 bit ENABLE_EXCEPTION_IEEE_754_FP Wavefront starts execution
4227 _INVALID_OPERATION with specified exceptions
4230 Used by CP to set up
4231 ``COMPUTE_PGM_RSRC2.EXCP_EN``
4232 (set from bits 0..6).
4236 25 1 bit ENABLE_EXCEPTION_FP_DENORMAL FP Denormal one or more
4237 _SOURCE input operands is a
4239 26 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Division by
4240 _DIVISION_BY_ZERO Zero
4241 27 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP FP Overflow
4243 28 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Underflow
4245 29 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Inexact
4247 30 1 bit ENABLE_EXCEPTION_INT_DIVIDE_BY Integer Division by Zero
4248 _ZERO (rcp_iflag_f32 instruction
4250 31 1 bit Reserved, must be 0.
4251 32 **Total size 4 bytes.**
4252 ======= ===================================================================================================================
4256 .. table:: compute_pgm_rsrc3 for GFX90A
4257 :name: amdgpu-amdhsa-compute_pgm_rsrc3-gfx90a-table
4259 ======= ======= =============================== ===========================================================================
4260 Bits Size Field Name Description
4261 ======= ======= =============================== ===========================================================================
4262 5:0 6 bits ACCUM_OFFSET Offset of a first AccVGPR in the unified register file. Granularity 4.
4263 Value 0-63. 0 - accum-offset = 4, 1 - accum-offset = 8, ...,
4264 63 - accum-offset = 256.
4265 6:15 10 Reserved, must be 0.
4267 16 1 bit TG_SPLIT - If 0 the waves of a work-group are
4268 launched in the same CU.
4269 - If 1 the waves of a work-group can be
4270 launched in different CUs. The waves
4271 cannot use S_BARRIER or LDS.
4272 17:31 15 Reserved, must be 0.
4274 32 **Total size 4 bytes.**
4275 ======= ===================================================================================================================
4279 .. table:: compute_pgm_rsrc3 for GFX10
4280 :name: amdgpu-amdhsa-compute_pgm_rsrc3-gfx10-table
4282 ======= ======= =============================== ===========================================================================
4283 Bits Size Field Name Description
4284 ======= ======= =============================== ===========================================================================
4285 3:0 4 bits SHARED_VGPR_COUNT Number of shared VGPRs for wavefront size 64. Granularity 8. Value 0-120.
4286 compute_pgm_rsrc1.vgprs + shared_vgpr_cnt cannot exceed 64.
4287 31:4 28 Reserved, must be 0.
4289 32 **Total size 4 bytes.**
4290 ======= ===================================================================================================================
4294 .. table:: Floating Point Rounding Mode Enumeration Values
4295 :name: amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table
4297 ====================================== ===== ==============================
4298 Enumeration Name Value Description
4299 ====================================== ===== ==============================
4300 FLOAT_ROUND_MODE_NEAR_EVEN 0 Round Ties To Even
4301 FLOAT_ROUND_MODE_PLUS_INFINITY 1 Round Toward +infinity
4302 FLOAT_ROUND_MODE_MINUS_INFINITY 2 Round Toward -infinity
4303 FLOAT_ROUND_MODE_ZERO 3 Round Toward 0
4304 ====================================== ===== ==============================
4308 .. table:: Floating Point Denorm Mode Enumeration Values
4309 :name: amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table
4311 ====================================== ===== ==============================
4312 Enumeration Name Value Description
4313 ====================================== ===== ==============================
4314 FLOAT_DENORM_MODE_FLUSH_SRC_DST 0 Flush Source and Destination
4316 FLOAT_DENORM_MODE_FLUSH_DST 1 Flush Output Denorms
4317 FLOAT_DENORM_MODE_FLUSH_SRC 2 Flush Source Denorms
4318 FLOAT_DENORM_MODE_FLUSH_NONE 3 No Flush
4319 ====================================== ===== ==============================
4323 .. table:: System VGPR Work-Item ID Enumeration Values
4324 :name: amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table
4326 ======================================== ===== ============================
4327 Enumeration Name Value Description
4328 ======================================== ===== ============================
4329 SYSTEM_VGPR_WORKITEM_ID_X 0 Set work-item X dimension
4331 SYSTEM_VGPR_WORKITEM_ID_X_Y 1 Set work-item X and Y
4333 SYSTEM_VGPR_WORKITEM_ID_X_Y_Z 2 Set work-item X, Y and Z
4335 SYSTEM_VGPR_WORKITEM_ID_UNDEFINED 3 Undefined.
4336 ======================================== ===== ============================
4338 .. _amdgpu-amdhsa-initial-kernel-execution-state:
4340 Initial Kernel Execution State
4341 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
4343 This section defines the register state that will be set up by the packet
4344 processor prior to the start of execution of every wavefront. This is limited by
4345 the constraints of the hardware controllers of CP/ADC/SPI.
4347 The order of the SGPR registers is defined, but the compiler can specify which
4348 ones are actually setup in the kernel descriptor using the ``enable_sgpr_*`` bit
4349 fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used
4350 for enabled registers are dense starting at SGPR0: the first enabled register is
4351 SGPR0, the next enabled register is SGPR1 etc.; disabled registers do not have
4354 The initial SGPRs comprise up to 16 User SRGPs that are set by CP and apply to
4355 all wavefronts of the grid. It is possible to specify more than 16 User SGPRs
4356 using the ``enable_sgpr_*`` bit fields, in which case only the first 16 are
4357 actually initialized. These are then immediately followed by the System SGPRs
4358 that are set up by ADC/SPI and can have different values for each wavefront of
4361 SGPR register initial state is defined in
4362 :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
4364 .. table:: SGPR Register Set Up Order
4365 :name: amdgpu-amdhsa-sgpr-register-set-up-order-table
4367 ========== ========================== ====== ==============================
4368 SGPR Order Name Number Description
4369 (kernel descriptor enable of
4371 ========== ========================== ====== ==============================
4372 First Private Segment Buffer 4 See
4373 (enable_sgpr_private :ref:`amdgpu-amdhsa-kernel-prolog-private-segment-buffer`.
4375 then Dispatch Ptr 2 64-bit address of AQL dispatch
4376 (enable_sgpr_dispatch_ptr) packet for kernel dispatch
4378 then Queue Ptr 2 64-bit address of amd_queue_t
4379 (enable_sgpr_queue_ptr) object for AQL queue on which
4380 the dispatch packet was
4382 then Kernarg Segment Ptr 2 64-bit address of Kernarg
4383 (enable_sgpr_kernarg segment. This is directly
4384 _segment_ptr) copied from the
4385 kernarg_address in the kernel
4388 Having CP load it once avoids
4389 loading it at the beginning of
4391 then Dispatch Id 2 64-bit Dispatch ID of the
4392 (enable_sgpr_dispatch_id) dispatch packet being
4394 then Flat Scratch Init 2 See
4395 (enable_sgpr_flat_scratch :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`.
4397 then Private Segment Size 1 The 32-bit byte size of a
4398 (enable_sgpr_private single work-item's memory
4399 _segment_size) allocation. This is the
4400 value from the kernel
4401 dispatch packet Private
4402 Segment Byte Size rounded up
4403 by CP to a multiple of
4406 Having CP load it once avoids
4407 loading it at the beginning of
4410 This is not used for
4411 GFX7-GFX8 since it is the same
4412 value as the second SGPR of
4413 Flat Scratch Init. However, it
4414 may be needed for GFX9-GFX10 which
4415 changes the meaning of the
4416 Flat Scratch Init value.
4417 then Work-Group Id X 1 32-bit work-group id in X
4418 (enable_sgpr_workgroup_id dimension of grid for
4420 then Work-Group Id Y 1 32-bit work-group id in Y
4421 (enable_sgpr_workgroup_id dimension of grid for
4423 then Work-Group Id Z 1 32-bit work-group id in Z
4424 (enable_sgpr_workgroup_id dimension of grid for
4426 then Work-Group Info 1 {first_wavefront, 14'b0000,
4427 (enable_sgpr_workgroup ordered_append_term[10:0],
4428 _info) threadgroup_size_in_wavefronts[5:0]}
4429 then Scratch Wavefront Offset 1 See
4430 (enable_sgpr_private :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`.
4431 _segment_wavefront_offset) and
4432 :ref:`amdgpu-amdhsa-kernel-prolog-private-segment-buffer`.
4433 ========== ========================== ====== ==============================
4435 The order of the VGPR registers is defined, but the compiler can specify which
4436 ones are actually setup in the kernel descriptor using the ``enable_vgpr*`` bit
4437 fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used
4438 for enabled registers are dense starting at VGPR0: the first enabled register is
4439 VGPR0, the next enabled register is VGPR1 etc.; disabled registers do not have a
4442 There are different methods used for the VGPR initial state:
4444 * Unless the *Target Properties* column of :ref:`amdgpu-processor-table`
4445 specifies otherwise, a separate VGPR register is used per work-item ID. The
4446 VGPR register initial state for this method is defined in
4447 :ref:`amdgpu-amdhsa-vgpr-register-set-up-order-for-unpacked-work-item-id-method-table`.
4448 * If *Target Properties* column of :ref:`amdgpu-processor-table`
4449 specifies *Packed work-item IDs*, the initial value of VGPR0 register is used
4450 for all work-item IDs. The register layout for this method is defined in
4451 :ref:`amdgpu-amdhsa-register-layout-for-packed-work-item-id-method-table`.
4453 .. table:: VGPR Register Set Up Order for Unpacked Work-Item ID Method
4454 :name: amdgpu-amdhsa-vgpr-register-set-up-order-for-unpacked-work-item-id-method-table
4456 ========== ========================== ====== ==============================
4457 VGPR Order Name Number Description
4458 (kernel descriptor enable of
4460 ========== ========================== ====== ==============================
4461 First Work-Item Id X 1 32-bit work-item id in X
4462 (Always initialized) dimension of work-group for
4464 then Work-Item Id Y 1 32-bit work-item id in Y
4465 (enable_vgpr_workitem_id dimension of work-group for
4466 > 0) wavefront lane.
4467 then Work-Item Id Z 1 32-bit work-item id in Z
4468 (enable_vgpr_workitem_id dimension of work-group for
4469 > 1) wavefront lane.
4470 ========== ========================== ====== ==============================
4474 .. table:: Register Layout for Packed Work-Item ID Method
4475 :name: amdgpu-amdhsa-register-layout-for-packed-work-item-id-method-table
4477 ======= ======= ================ =========================================
4478 Bits Size Field Name Description
4479 ======= ======= ================ =========================================
4480 0:9 10 bits Work-Item Id X Work-item id in X
4481 dimension of work-group for
4486 10:19 10 bits Work-Item Id Y Work-item id in Y
4487 dimension of work-group for
4490 Initialized if enable_vgpr_workitem_id >
4491 0, otherwise set to 0.
4492 20:29 10 bits Work-Item Id Z Work-item id in Z
4493 dimension of work-group for
4496 Initialized if enable_vgpr_workitem_id >
4497 1, otherwise set to 0.
4498 30:31 2 bits Reserved, set to 0.
4499 ======= ======= ================ =========================================
4501 The setting of registers is done by GPU CP/ADC/SPI hardware as follows:
4503 1. SGPRs before the Work-Group Ids are set by CP using the 16 User Data
4505 2. Work-group Id registers X, Y, Z are set by ADC which supports any
4506 combination including none.
4507 3. Scratch Wavefront Offset is set by SPI in a per wavefront basis which is why
4508 its value cannot be included with the flat scratch init value which is per
4509 queue (see :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`).
4510 4. The VGPRs are set by SPI which only supports specifying either (X), (X, Y)
4512 5. Flat Scratch register pair initialization is described in
4513 :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`.
4515 The global segment can be accessed either using buffer instructions (GFX6 which
4516 has V# 64-bit address support), flat instructions (GFX7-GFX10), or global
4517 instructions (GFX9-GFX10).
4519 If buffer operations are used, then the compiler can generate a V# with the
4520 following properties:
4524 * ATC: 1 if IOMMU present (such as APU)
4526 * MTYPE set to support memory coherence that matches the runtime (such as CC for
4527 APU and NC for dGPU).
4529 .. _amdgpu-amdhsa-kernel-prolog:
4534 The compiler performs initialization in the kernel prologue depending on the
4535 target and information about things like stack usage in the kernel and called
4536 functions. Some of this initialization requires the compiler to request certain
4537 User and System SGPRs be present in the
4538 :ref:`amdgpu-amdhsa-initial-kernel-execution-state` via the
4539 :ref:`amdgpu-amdhsa-kernel-descriptor`.
4541 .. _amdgpu-amdhsa-kernel-prolog-cfi:
4546 1. The CFI return address is undefined.
4548 2. The CFI CFA is defined using an expression which evaluates to a location
4549 description that comprises one memory location description for the
4550 ``DW_ASPACE_AMDGPU_private_lane`` address space address ``0``.
4552 .. _amdgpu-amdhsa-kernel-prolog-m0:
4558 The M0 register must be initialized with a value at least the total LDS size
4559 if the kernel may access LDS via DS or flat operations. Total LDS size is
4560 available in dispatch packet. For M0, it is also possible to use maximum
4561 possible value of LDS for given target (0x7FFF for GFX6 and 0xFFFF for
4564 The M0 register is not used for range checking LDS accesses and so does not
4565 need to be initialized in the prolog.
4567 .. _amdgpu-amdhsa-kernel-prolog-stack-pointer:
4572 If the kernel has function calls it must set up the ABI stack pointer described
4573 in :ref:`amdgpu-amdhsa-function-call-convention-non-kernel-functions` by setting
4574 SGPR32 to the unswizzled scratch offset of the address past the last local
4577 .. _amdgpu-amdhsa-kernel-prolog-frame-pointer:
4582 If the kernel needs a frame pointer for the reasons defined in
4583 ``SIFrameLowering`` then SGPR33 is used and is always set to ``0`` in the
4584 kernel prolog. If a frame pointer is not required then all uses of the frame
4585 pointer are replaced with immediate ``0`` offsets.
4587 .. _amdgpu-amdhsa-kernel-prolog-flat-scratch:
4592 There are different methods used for initializing flat scratch:
4594 * If the *Target Properties* column of :ref:`amdgpu-processor-table`
4595 specifies *Does not support generic address space*:
4597 Flat scratch is not supported and there is no flat scratch register pair.
4599 * If the *Target Properties* column of :ref:`amdgpu-processor-table`
4600 specifies *Offset flat scratch*:
4602 If the kernel or any function it calls may use flat operations to access
4603 scratch memory, the prolog code must set up the FLAT_SCRATCH register pair
4604 (FLAT_SCRATCH_LO/FLAT_SCRATCH_HI). Initialization uses Flat Scratch Init and
4605 Scratch Wavefront Offset SGPR registers (see
4606 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`):
4608 1. The low word of Flat Scratch Init is the 32-bit byte offset from
4609 ``SH_HIDDEN_PRIVATE_BASE_VIMID`` to the base of scratch backing memory
4610 being managed by SPI for the queue executing the kernel dispatch. This is
4611 the same value used in the Scratch Segment Buffer V# base address.
4613 CP obtains this from the runtime. (The Scratch Segment Buffer base address
4614 is ``SH_HIDDEN_PRIVATE_BASE_VIMID`` plus this offset.)
4616 The prolog must add the value of Scratch Wavefront Offset to get the
4617 wavefront's byte scratch backing memory offset from
4618 ``SH_HIDDEN_PRIVATE_BASE_VIMID``.
4620 The Scratch Wavefront Offset must also be used as an offset with Private
4621 segment address when using the Scratch Segment Buffer.
4623 Since FLAT_SCRATCH_LO is in units of 256 bytes, the offset must be right
4624 shifted by 8 before moving into FLAT_SCRATCH_HI.
4626 FLAT_SCRATCH_HI corresponds to SGPRn-4 on GFX7, and SGPRn-6 on GFX8 (where
4627 SGPRn is the highest numbered SGPR allocated to the wavefront).
4628 FLAT_SCRATCH_HI is multiplied by 256 (as it is in units of 256 bytes) and
4629 added to ``SH_HIDDEN_PRIVATE_BASE_VIMID`` to calculate the per wavefront
4630 FLAT SCRATCH BASE in flat memory instructions that access the scratch
4632 2. The second word of Flat Scratch Init is 32-bit byte size of a single
4633 work-items scratch memory usage.
4635 CP obtains this from the runtime, and it is always a multiple of DWORD. CP
4636 checks that the value in the kernel dispatch packet Private Segment Byte
4637 Size is not larger and requests the runtime to increase the queue's scratch
4640 CP directly loads from the kernel dispatch packet Private Segment Byte Size
4641 field and rounds up to a multiple of DWORD. Having CP load it once avoids
4642 loading it at the beginning of every wavefront.
4644 The kernel prolog code must move it to FLAT_SCRATCH_LO which is SGPRn-3 on
4645 GFX7 and SGPRn-5 on GFX8. FLAT_SCRATCH_LO is used as the FLAT SCRATCH SIZE
4646 in flat memory instructions.
4648 * If the *Target Properties* column of :ref:`amdgpu-processor-table`
4649 specifies *Absolute flat scratch*:
4651 If the kernel or any function it calls may use flat operations to access
4652 scratch memory, the prolog code must set up the FLAT_SCRATCH register pair
4653 (FLAT_SCRATCH_LO/FLAT_SCRATCH_HI which are in SGPRn-4/SGPRn-3). Initialization
4654 uses Flat Scratch Init and Scratch Wavefront Offset SGPR registers (see
4655 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`):
4657 The Flat Scratch Init is the 64-bit address of the base of scratch backing
4658 memory being managed by SPI for the queue executing the kernel dispatch.
4660 CP obtains this from the runtime.
4662 The kernel prolog must add the value of the wave's Scratch Wavefront Offset
4663 and move the result as a 64-bit value to the FLAT_SCRATCH SGPR register pair
4664 which is SGPRn-6 and SGPRn-5. It is used as the FLAT SCRATCH BASE in flat
4665 memory instructions.
4667 The Scratch Wavefront Offset must also be used as an offset with Private
4668 segment address when using the Scratch Segment Buffer (see
4669 :ref:`amdgpu-amdhsa-kernel-prolog-private-segment-buffer`).
4671 * If the *Target Properties* column of :ref:`amdgpu-processor-table`
4672 specifies *Architected flat scratch*:
4674 If ENABLE_PRIVATE_SEGMENT is enabled in
4675 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table` then the FLAT_SCRATCH
4676 register pair will be initialized to the 64-bit address of the base of scratch
4677 backing memory being managed by SPI for the queue executing the kernel
4678 dispatch plus the value of the wave's Scratch Wavefront Offset for use as the
4679 flat scratch base in flat memory instructions.
4681 .. _amdgpu-amdhsa-kernel-prolog-private-segment-buffer:
4683 Private Segment Buffer
4684 ++++++++++++++++++++++
4686 If the *Target Properties* column of :ref:`amdgpu-processor-table` specifies
4687 *Architected flat scratch* then a Private Segment Buffer is not supported.
4688 Instead the flat SCRATCH instructions are used.
4690 Otherwise, Private Segment Buffer SGPR register is used to initialize 4 SGPRs
4691 that are used as a V# to access scratch. CP uses the value provided by the
4692 runtime. It is used, together with Scratch Wavefront Offset as an offset, to
4693 access the private memory space using a segment address. See
4694 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`.
4696 The scratch V# is a four-aligned SGPR and always selected for the kernel as
4699 - If it is known during instruction selection that there is stack usage,
4700 SGPR0-3 is reserved for use as the scratch V#. Stack usage is assumed if
4701 optimizations are disabled (``-O0``), if stack objects already exist (for
4702 locals, etc.), or if there are any function calls.
4704 - Otherwise, four high numbered SGPRs beginning at a four-aligned SGPR index
4705 are reserved for the tentative scratch V#. These will be used if it is
4706 determined that spilling is needed.
4708 - If no use is made of the tentative scratch V#, then it is unreserved,
4709 and the register count is determined ignoring it.
4710 - If use is made of the tentative scratch V#, then its register numbers
4711 are shifted to the first four-aligned SGPR index after the highest one
4712 allocated by the register allocator, and all uses are updated. The
4713 register count includes them in the shifted location.
4714 - In either case, if the processor has the SGPR allocation bug, the
4715 tentative allocation is not shifted or unreserved in order to ensure
4716 the register count is higher to workaround the bug.
4720 This approach of using a tentative scratch V# and shifting the register
4721 numbers if used avoids having to perform register allocation a second
4722 time if the tentative V# is eliminated. This is more efficient and
4723 avoids the problem that the second register allocation may perform
4724 spilling which will fail as there is no longer a scratch V#.
4726 When the kernel prolog code is being emitted it is known whether the scratch V#
4727 described above is actually used. If it is, the prolog code must set it up by
4728 copying the Private Segment Buffer to the scratch V# registers and then adding
4729 the Private Segment Wavefront Offset to the queue base address in the V#. The
4730 result is a V# with a base address pointing to the beginning of the wavefront
4731 scratch backing memory.
4733 The Private Segment Buffer is always requested, but the Private Segment
4734 Wavefront Offset is only requested if it is used (see
4735 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
4737 .. _amdgpu-amdhsa-memory-model:
4742 This section describes the mapping of the LLVM memory model onto AMDGPU machine
4743 code (see :ref:`memmodel`).
4745 The AMDGPU backend supports the memory synchronization scopes specified in
4746 :ref:`amdgpu-memory-scopes`.
4748 The code sequences used to implement the memory model specify the order of
4749 instructions that a single thread must execute. The ``s_waitcnt`` and cache
4750 management instructions such as ``buffer_wbinvl1_vol`` are defined with respect
4751 to other memory instructions executed by the same thread. This allows them to be
4752 moved earlier or later which can allow them to be combined with other instances
4753 of the same instruction, or hoisted/sunk out of loops to improve performance.
4754 Only the instructions related to the memory model are given; additional
4755 ``s_waitcnt`` instructions are required to ensure registers are defined before
4756 being used. These may be able to be combined with the memory model ``s_waitcnt``
4757 instructions as described above.
4759 The AMDGPU backend supports the following memory models:
4761 HSA Memory Model [HSA]_
4762 The HSA memory model uses a single happens-before relation for all address
4763 spaces (see :ref:`amdgpu-address-spaces`).
4764 OpenCL Memory Model [OpenCL]_
4765 The OpenCL memory model which has separate happens-before relations for the
4766 global and local address spaces. Only a fence specifying both global and
4767 local address space, and seq_cst instructions join the relationships. Since
4768 the LLVM ``memfence`` instruction does not allow an address space to be
4769 specified the OpenCL fence has to conservatively assume both local and
4770 global address space was specified. However, optimizations can often be
4771 done to eliminate the additional ``s_waitcnt`` instructions when there are
4772 no intervening memory instructions which access the corresponding address
4773 space. The code sequences in the table indicate what can be omitted for the
4774 OpenCL memory. The target triple environment is used to determine if the
4775 source language is OpenCL (see :ref:`amdgpu-opencl`).
4777 ``ds/flat_load/store/atomic`` instructions to local memory are termed LDS
4780 ``buffer/global/flat_load/store/atomic`` instructions to global memory are
4781 termed vector memory operations.
4783 Private address space uses ``buffer_load/store`` using the scratch V#
4784 (GFX6-GFX8), or ``scratch_load/store`` (GFX9-GFX10). Since only a single thread
4785 is accessing the memory, atomic memory orderings are not meaningful, and all
4786 accesses are treated as non-atomic.
4788 Constant address space uses ``buffer/global_load`` instructions (or equivalent
4789 scalar memory instructions). Since the constant address space contents do not
4790 change during the execution of a kernel dispatch it is not legal to perform
4791 stores, and atomic memory orderings are not meaningful, and all accesses are
4792 treated as non-atomic.
4794 A memory synchronization scope wider than work-group is not meaningful for the
4795 group (LDS) address space and is treated as work-group.
4797 The memory model does not support the region address space which is treated as
4800 Acquire memory ordering is not meaningful on store atomic instructions and is
4801 treated as non-atomic.
4803 Release memory ordering is not meaningful on load atomic instructions and is
4804 treated a non-atomic.
4806 Acquire-release memory ordering is not meaningful on load or store atomic
4807 instructions and is treated as acquire and release respectively.
4809 The memory order also adds the single thread optimization constraints defined in
4811 :ref:`amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-table`.
4813 .. table:: AMDHSA Memory Model Single Thread Optimization Constraints
4814 :name: amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-table
4816 ============ ==============================================================
4817 LLVM Memory Optimization Constraints
4819 ============ ==============================================================
4822 acquire - If a load atomic/atomicrmw then no following load/load
4823 atomic/store/store atomic/atomicrmw/fence instruction can be
4824 moved before the acquire.
4825 - If a fence then same as load atomic, plus no preceding
4826 associated fence-paired-atomic can be moved after the fence.
4827 release - If a store atomic/atomicrmw then no preceding load/load
4828 atomic/store/store atomic/atomicrmw/fence instruction can be
4829 moved after the release.
4830 - If a fence then same as store atomic, plus no following
4831 associated fence-paired-atomic can be moved before the
4833 acq_rel Same constraints as both acquire and release.
4834 seq_cst - If a load atomic then same constraints as acquire, plus no
4835 preceding sequentially consistent load atomic/store
4836 atomic/atomicrmw/fence instruction can be moved after the
4838 - If a store atomic then the same constraints as release, plus
4839 no following sequentially consistent load atomic/store
4840 atomic/atomicrmw/fence instruction can be moved before the
4842 - If an atomicrmw/fence then same constraints as acq_rel.
4843 ============ ==============================================================
4845 The code sequences used to implement the memory model are defined in the
4848 * :ref:`amdgpu-amdhsa-memory-model-gfx6-gfx9`
4849 * :ref:`amdgpu-amdhsa-memory-model-gfx90a`
4850 * :ref:`amdgpu-amdhsa-memory-model-gfx10`
4852 .. _amdgpu-amdhsa-memory-model-gfx6-gfx9:
4854 Memory Model GFX6-GFX9
4855 ++++++++++++++++++++++
4859 * Each agent has multiple shader arrays (SA).
4860 * Each SA has multiple compute units (CU).
4861 * Each CU has multiple SIMDs that execute wavefronts.
4862 * The wavefronts for a single work-group are executed in the same CU but may be
4863 executed by different SIMDs.
4864 * Each CU has a single LDS memory shared by the wavefronts of the work-groups
4866 * All LDS operations of a CU are performed as wavefront wide operations in a
4867 global order and involve no caching. Completion is reported to a wavefront in
4869 * The LDS memory has multiple request queues shared by the SIMDs of a
4870 CU. Therefore, the LDS operations performed by different wavefronts of a
4871 work-group can be reordered relative to each other, which can result in
4872 reordering the visibility of vector memory operations with respect to LDS
4873 operations of other wavefronts in the same work-group. A ``s_waitcnt
4874 lgkmcnt(0)`` is required to ensure synchronization between LDS operations and
4875 vector memory operations between wavefronts of a work-group, but not between
4876 operations performed by the same wavefront.
4877 * The vector memory operations are performed as wavefront wide operations and
4878 completion is reported to a wavefront in execution order. The exception is
4879 that for GFX7-GFX9 ``flat_load/store/atomic`` instructions can report out of
4880 vector memory order if they access LDS memory, and out of LDS operation order
4881 if they access global memory.
4882 * The vector memory operations access a single vector L1 cache shared by all
4883 SIMDs a CU. Therefore, no special action is required for coherence between the
4884 lanes of a single wavefront, or for coherence between wavefronts in the same
4885 work-group. A ``buffer_wbinvl1_vol`` is required for coherence between
4886 wavefronts executing in different work-groups as they may be executing on
4888 * The scalar memory operations access a scalar L1 cache shared by all wavefronts
4889 on a group of CUs. The scalar and vector L1 caches are not coherent. However,
4890 scalar operations are used in a restricted way so do not impact the memory
4891 model. See :ref:`amdgpu-amdhsa-memory-spaces`.
4892 * The vector and scalar memory operations use an L2 cache shared by all CUs on
4894 * The L2 cache has independent channels to service disjoint ranges of virtual
4896 * Each CU has a separate request queue per channel. Therefore, the vector and
4897 scalar memory operations performed by wavefronts executing in different
4898 work-groups (which may be executing on different CUs) of an agent can be
4899 reordered relative to each other. A ``s_waitcnt vmcnt(0)`` is required to
4900 ensure synchronization between vector memory operations of different CUs. It
4901 ensures a previous vector memory operation has completed before executing a
4902 subsequent vector memory or LDS operation and so can be used to meet the
4903 requirements of acquire and release.
4904 * The L2 cache can be kept coherent with other agents on some targets, or ranges
4905 of virtual addresses can be set up to bypass it to ensure system coherence.
4907 Scalar memory operations are only used to access memory that is proven to not
4908 change during the execution of the kernel dispatch. This includes constant
4909 address space and global address space for program scope ``const`` variables.
4910 Therefore, the kernel machine code does not have to maintain the scalar cache to
4911 ensure it is coherent with the vector caches. The scalar and vector caches are
4912 invalidated between kernel dispatches by CP since constant address space data
4913 may change between kernel dispatch executions. See
4914 :ref:`amdgpu-amdhsa-memory-spaces`.
4916 The one exception is if scalar writes are used to spill SGPR registers. In this
4917 case the AMDGPU backend ensures the memory location used to spill is never
4918 accessed by vector memory operations at the same time. If scalar writes are used
4919 then a ``s_dcache_wb`` is inserted before the ``s_endpgm`` and before a function
4920 return since the locations may be used for vector memory instructions by a
4921 future wavefront that uses the same scratch area, or a function call that
4922 creates a frame at the same address, respectively. There is no need for a
4923 ``s_dcache_inv`` as all scalar writes are write-before-read in the same thread.
4925 For kernarg backing memory:
4927 * CP invalidates the L1 cache at the start of each kernel dispatch.
4928 * On dGPU the kernarg backing memory is allocated in host memory accessed as
4929 MTYPE UC (uncached) to avoid needing to invalidate the L2 cache. This also
4930 causes it to be treated as non-volatile and so is not invalidated by
4932 * On APU the kernarg backing memory it is accessed as MTYPE CC (cache coherent)
4933 and so the L2 cache will be coherent with the CPU and other agents.
4935 Scratch backing memory (which is used for the private address space) is accessed
4936 with MTYPE NC_NV (non-coherent non-volatile). Since the private address space is
4937 only accessed by a single thread, and is always write-before-read, there is
4938 never a need to invalidate these entries from the L1 cache. Hence all cache
4939 invalidates are done as ``*_vol`` to only invalidate the volatile cache lines.
4941 The code sequences used to implement the memory model for GFX6-GFX9 are defined
4942 in table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table`.
4944 .. table:: AMDHSA Memory Model Code Sequences GFX6-GFX9
4945 :name: amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table
4947 ============ ============ ============== ========== ================================
4948 LLVM Instr LLVM Memory LLVM Memory AMDGPU AMDGPU Machine Code
4949 Ordering Sync Scope Address GFX6-GFX9
4951 ============ ============ ============== ========== ================================
4953 ------------------------------------------------------------------------------------
4954 load *none* *none* - global - !volatile & !nontemporal
4956 - private 1. buffer/global/flat_load
4958 - !volatile & nontemporal
4960 1. buffer/global/flat_load
4965 1. buffer/global/flat_load
4967 2. s_waitcnt vmcnt(0)
4969 - Must happen before
4970 any following volatile
4981 load *none* *none* - local 1. ds_load
4982 store *none* *none* - global - !volatile & !nontemporal
4984 - private 1. buffer/global/flat_store
4986 - !volatile & nontemporal
4988 1. buffer/global/flat_store
4993 1. buffer/global/flat_store
4994 2. s_waitcnt vmcnt(0)
4996 - Must happen before
4997 any following volatile
5008 store *none* *none* - local 1. ds_store
5009 **Unordered Atomic**
5010 ------------------------------------------------------------------------------------
5011 load atomic unordered *any* *any* *Same as non-atomic*.
5012 store atomic unordered *any* *any* *Same as non-atomic*.
5013 atomicrmw unordered *any* *any* *Same as monotonic atomic*.
5014 **Monotonic Atomic**
5015 ------------------------------------------------------------------------------------
5016 load atomic monotonic - singlethread - global 1. buffer/global/ds/flat_load
5018 - workgroup - generic
5019 load atomic monotonic - agent - global 1. buffer/global/flat_load
5020 - system - generic glc=1
5021 store atomic monotonic - singlethread - global 1. buffer/global/flat_store
5022 - wavefront - generic
5026 store atomic monotonic - singlethread - local 1. ds_store
5029 atomicrmw monotonic - singlethread - global 1. buffer/global/flat_atomic
5030 - wavefront - generic
5034 atomicrmw monotonic - singlethread - local 1. ds_atomic
5038 ------------------------------------------------------------------------------------
5039 load atomic acquire - singlethread - global 1. buffer/global/ds/flat_load
5042 load atomic acquire - workgroup - global 1. buffer/global_load
5043 load atomic acquire - workgroup - local 1. ds/flat_load
5044 - generic 2. s_waitcnt lgkmcnt(0)
5047 - Must happen before
5056 older than a local load
5060 load atomic acquire - agent - global 1. buffer/global_load
5062 2. s_waitcnt vmcnt(0)
5064 - Must happen before
5072 3. buffer_wbinvl1_vol
5074 - Must happen before
5084 load atomic acquire - agent - generic 1. flat_load glc=1
5085 - system 2. s_waitcnt vmcnt(0) &
5090 - Must happen before
5093 - Ensures the flat_load
5098 3. buffer_wbinvl1_vol
5100 - Must happen before
5110 atomicrmw acquire - singlethread - global 1. buffer/global/ds/flat_atomic
5113 atomicrmw acquire - workgroup - global 1. buffer/global_atomic
5114 atomicrmw acquire - workgroup - local 1. ds/flat_atomic
5115 - generic 2. s_waitcnt lgkmcnt(0)
5118 - Must happen before
5131 atomicrmw acquire - agent - global 1. buffer/global_atomic
5132 - system 2. s_waitcnt vmcnt(0)
5134 - Must happen before
5143 3. buffer_wbinvl1_vol
5145 - Must happen before
5155 atomicrmw acquire - agent - generic 1. flat_atomic
5156 - system 2. s_waitcnt vmcnt(0) &
5161 - Must happen before
5170 3. buffer_wbinvl1_vol
5172 - Must happen before
5182 fence acquire - singlethread *none* *none*
5184 fence acquire - workgroup *none* 1. s_waitcnt lgkmcnt(0)
5189 - However, since LLVM
5214 fence-paired-atomic).
5215 - Must happen before
5226 fence-paired-atomic.
5228 fence acquire - agent *none* 1. s_waitcnt lgkmcnt(0) &
5235 - However, since LLVM
5243 - Could be split into
5252 - s_waitcnt vmcnt(0)
5263 fence-paired-atomic).
5264 - s_waitcnt lgkmcnt(0)
5275 fence-paired-atomic).
5276 - Must happen before
5290 fence-paired-atomic.
5292 2. buffer_wbinvl1_vol
5294 - Must happen before any
5295 following global/generic
5305 ------------------------------------------------------------------------------------
5306 store atomic release - singlethread - global 1. buffer/global/ds/flat_store
5309 store atomic release - workgroup - global 1. s_waitcnt lgkmcnt(0)
5318 - Must happen before
5329 2. buffer/global/flat_store
5330 store atomic release - workgroup - local 1. ds_store
5331 store atomic release - agent - global 1. s_waitcnt lgkmcnt(0) &
5332 - system - generic vmcnt(0)
5338 - Could be split into
5347 - s_waitcnt vmcnt(0)
5354 - s_waitcnt lgkmcnt(0)
5361 - Must happen before
5372 2. buffer/global/flat_store
5373 atomicrmw release - singlethread - global 1. buffer/global/ds/flat_atomic
5376 atomicrmw release - workgroup - global 1. s_waitcnt lgkmcnt(0)
5385 - Must happen before
5396 2. buffer/global/flat_atomic
5397 atomicrmw release - workgroup - local 1. ds_atomic
5398 atomicrmw release - agent - global 1. s_waitcnt lgkmcnt(0) &
5399 - system - generic vmcnt(0)
5403 - Could be split into
5412 - s_waitcnt vmcnt(0)
5419 - s_waitcnt lgkmcnt(0)
5426 - Must happen before
5437 2. buffer/global/flat_atomic
5438 fence release - singlethread *none* *none*
5440 fence release - workgroup *none* 1. s_waitcnt lgkmcnt(0)
5445 - However, since LLVM
5466 - Must happen before
5475 fence-paired-atomic).
5482 fence-paired-atomic.
5484 fence release - agent *none* 1. s_waitcnt lgkmcnt(0) &
5495 - However, since LLVM
5510 - Could be split into
5519 - s_waitcnt vmcnt(0)
5526 - s_waitcnt lgkmcnt(0)
5533 - Must happen before
5542 fence-paired-atomic).
5549 fence-paired-atomic.
5551 **Acquire-Release Atomic**
5552 ------------------------------------------------------------------------------------
5553 atomicrmw acq_rel - singlethread - global 1. buffer/global/ds/flat_atomic
5556 atomicrmw acq_rel - workgroup - global 1. s_waitcnt lgkmcnt(0)
5565 - Must happen before
5576 2. buffer/global_atomic
5578 atomicrmw acq_rel - workgroup - local 1. ds_atomic
5579 2. s_waitcnt lgkmcnt(0)
5582 - Must happen before
5591 older than the local load
5595 atomicrmw acq_rel - workgroup - generic 1. s_waitcnt lgkmcnt(0)
5604 - Must happen before
5616 3. s_waitcnt lgkmcnt(0)
5619 - Must happen before
5628 older than a local load
5632 atomicrmw acq_rel - agent - global 1. s_waitcnt lgkmcnt(0) &
5637 - Could be split into
5646 - s_waitcnt vmcnt(0)
5653 - s_waitcnt lgkmcnt(0)
5660 - Must happen before
5671 2. buffer/global_atomic
5672 3. s_waitcnt vmcnt(0)
5674 - Must happen before
5683 4. buffer_wbinvl1_vol
5685 - Must happen before
5695 atomicrmw acq_rel - agent - generic 1. s_waitcnt lgkmcnt(0) &
5700 - Could be split into
5709 - s_waitcnt vmcnt(0)
5716 - s_waitcnt lgkmcnt(0)
5723 - Must happen before
5735 3. s_waitcnt vmcnt(0) &
5740 - Must happen before
5749 4. buffer_wbinvl1_vol
5751 - Must happen before
5761 fence acq_rel - singlethread *none* *none*
5763 fence acq_rel - workgroup *none* 1. s_waitcnt lgkmcnt(0)
5783 - Must happen before
5806 acquire-fence-paired-atomic)
5827 release-fence-paired-atomic).
5832 fence acq_rel - agent *none* 1. s_waitcnt lgkmcnt(0) &
5839 - However, since LLVM
5847 - Could be split into
5856 - s_waitcnt vmcnt(0)
5863 - s_waitcnt lgkmcnt(0)
5870 - Must happen before
5875 global/local/generic
5884 acquire-fence-paired-atomic)
5896 global/local/generic
5905 release-fence-paired-atomic).
5910 2. buffer_wbinvl1_vol
5912 - Must happen before
5926 **Sequential Consistent Atomic**
5927 ------------------------------------------------------------------------------------
5928 load atomic seq_cst - singlethread - global *Same as corresponding
5929 - wavefront - local load atomic acquire,
5930 - generic except must generate
5931 all instructions even
5933 load atomic seq_cst - workgroup - global 1. s_waitcnt lgkmcnt(0)
5949 lgkmcnt(0) and so do
5981 order. The s_waitcnt
5982 could be placed after
5986 make the s_waitcnt be
5993 instructions same as
5996 except must generate
5997 all instructions even
5999 load atomic seq_cst - workgroup - local *Same as corresponding
6000 load atomic acquire,
6001 except must generate
6002 all instructions even
6005 load atomic seq_cst - agent - global 1. s_waitcnt lgkmcnt(0) &
6006 - system - generic vmcnt(0)
6008 - Could be split into
6017 - s_waitcnt lgkmcnt(0)
6030 lgkmcnt(0) and so do
6033 - s_waitcnt vmcnt(0)
6078 order. The s_waitcnt
6079 could be placed after
6083 make the s_waitcnt be
6090 instructions same as
6093 except must generate
6094 all instructions even
6096 store atomic seq_cst - singlethread - global *Same as corresponding
6097 - wavefront - local store atomic release,
6098 - workgroup - generic except must generate
6099 - agent all instructions even
6100 - system for OpenCL.*
6101 atomicrmw seq_cst - singlethread - global *Same as corresponding
6102 - wavefront - local atomicrmw acq_rel,
6103 - workgroup - generic except must generate
6104 - agent all instructions even
6105 - system for OpenCL.*
6106 fence seq_cst - singlethread *none* *Same as corresponding
6107 - wavefront fence acq_rel,
6108 - workgroup except must generate
6109 - agent all instructions even
6110 - system for OpenCL.*
6111 ============ ============ ============== ========== ================================
6113 .. _amdgpu-amdhsa-memory-model-gfx90a:
6120 * Each agent has multiple shader arrays (SA).
6121 * Each SA has multiple compute units (CU).
6122 * Each CU has multiple SIMDs that execute wavefronts.
6123 * The wavefronts for a single work-group are executed in the same CU but may be
6124 executed by different SIMDs. The exception is when in tgsplit execution mode
6125 when the wavefronts may be executed by different SIMDs in different CUs.
6126 * Each CU has a single LDS memory shared by the wavefronts of the work-groups
6127 executing on it. The exception is when in tgsplit execution mode when no LDS
6128 is allocated as wavefronts of the same work-group can be in different CUs.
6129 * All LDS operations of a CU are performed as wavefront wide operations in a
6130 global order and involve no caching. Completion is reported to a wavefront in
6132 * The LDS memory has multiple request queues shared by the SIMDs of a
6133 CU. Therefore, the LDS operations performed by different wavefronts of a
6134 work-group can be reordered relative to each other, which can result in
6135 reordering the visibility of vector memory operations with respect to LDS
6136 operations of other wavefronts in the same work-group. A ``s_waitcnt
6137 lgkmcnt(0)`` is required to ensure synchronization between LDS operations and
6138 vector memory operations between wavefronts of a work-group, but not between
6139 operations performed by the same wavefront.
6140 * The vector memory operations are performed as wavefront wide operations and
6141 completion is reported to a wavefront in execution order. The exception is
6142 that ``flat_load/store/atomic`` instructions can report out of vector memory
6143 order if they access LDS memory, and out of LDS operation order if they access
6145 * The vector memory operations access a single vector L1 cache shared by all
6146 SIMDs a CU. Therefore:
6148 * No special action is required for coherence between the lanes of a single
6151 * No special action is required for coherence between wavefronts in the same
6152 work-group since they execute on the same CU. The exception is when in
6153 tgsplit execution mode as wavefronts of the same work-group can be in
6154 different CUs and so a ``buffer_wbinvl1_vol`` is required as described in
6157 * A ``buffer_wbinvl1_vol`` is required for coherence between wavefronts
6158 executing in different work-groups as they may be executing on different
6161 * The scalar memory operations access a scalar L1 cache shared by all wavefronts
6162 on a group of CUs. The scalar and vector L1 caches are not coherent. However,
6163 scalar operations are used in a restricted way so do not impact the memory
6164 model. See :ref:`amdgpu-amdhsa-memory-spaces`.
6165 * The vector and scalar memory operations use an L2 cache shared by all CUs on
6168 * The L2 cache has independent channels to service disjoint ranges of virtual
6170 * Each CU has a separate request queue per channel. Therefore, the vector and
6171 scalar memory operations performed by wavefronts executing in different
6172 work-groups (which may be executing on different CUs), or the same
6173 work-group if executing in tgsplit mode, of an agent can be reordered
6174 relative to each other. A ``s_waitcnt vmcnt(0)`` is required to ensure
6175 synchronization between vector memory operations of different CUs. It
6176 ensures a previous vector memory operation has completed before executing a
6177 subsequent vector memory or LDS operation and so can be used to meet the
6178 requirements of acquire and release.
6179 * The L2 cache of one agent can be kept coherent with other agents by:
6180 using the MTYPE RW (read-write) or MTYPE CC (cache-coherent) with the PTE
6181 C-bit for memory local to the L2; and using the MTYPE NC (non-coherent) with
6182 the PTE C-bit set or MTYPE UC (uncached) for memory not local to the L2.
6184 * Any local memory cache lines will be automatically invalidated by writes
6185 from CUs associated with other L2 caches, or writes from the CPU, due to
6186 the cache probe caused by coherent requests. Coherent requests are caused
6187 by GPU accesses to pages with the PTE C-bit set, by CPU accesses over
6188 XGMI, and by PCIe requests that are configured to be coherent requests.
6189 * XGMI accesses from the CPU to local memory may be cached on the CPU.
6190 Subsequent access from the GPU will automatically invalidate or writeback
6191 the CPU cache due to the L2 probe filter and and the PTE C-bit being set.
6192 * Since all work-groups on the same agent share the same L2, no L2
6193 invalidation or writeback is required for coherence.
6194 * To ensure coherence of local and remote memory writes of work-groups in
6195 different agents a ``buffer_wbl2`` is required. It will writeback dirty L2
6196 cache lines of MTYPE RW (used for local coarse grain memory) and MTYPE NC
6197 ()used for remote coarse grain memory). Note that MTYPE CC (used for local
6198 fine grain memory) causes write through to DRAM, and MTYPE UC (used for
6199 remote fine grain memory) bypasses the L2, so both will never result in
6200 dirty L2 cache lines.
6201 * To ensure coherence of local and remote memory reads of work-groups in
6202 different agents a ``buffer_invl2`` is required. It will invalidate L2
6203 cache lines with MTYPE NC (used for remote coarse grain memory). Note that
6204 MTYPE CC (used for local fine grain memory) and MTYPE RW (used for local
6205 coarse memory) cause local reads to be invalidated by remote writes with
6206 with the PTE C-bit so these cache lines are not invalidated. Note that
6207 MTYPE UC (used for remote fine grain memory) bypasses the L2, so will
6208 never result in L2 cache lines that need to be invalidated.
6210 * PCIe access from the GPU to the CPU memory is kept coherent by using the
6211 MTYPE UC (uncached) which bypasses the L2.
6213 Scalar memory operations are only used to access memory that is proven to not
6214 change during the execution of the kernel dispatch. This includes constant
6215 address space and global address space for program scope ``const`` variables.
6216 Therefore, the kernel machine code does not have to maintain the scalar cache to
6217 ensure it is coherent with the vector caches. The scalar and vector caches are
6218 invalidated between kernel dispatches by CP since constant address space data
6219 may change between kernel dispatch executions. See
6220 :ref:`amdgpu-amdhsa-memory-spaces`.
6222 The one exception is if scalar writes are used to spill SGPR registers. In this
6223 case the AMDGPU backend ensures the memory location used to spill is never
6224 accessed by vector memory operations at the same time. If scalar writes are used
6225 then a ``s_dcache_wb`` is inserted before the ``s_endpgm`` and before a function
6226 return since the locations may be used for vector memory instructions by a
6227 future wavefront that uses the same scratch area, or a function call that
6228 creates a frame at the same address, respectively. There is no need for a
6229 ``s_dcache_inv`` as all scalar writes are write-before-read in the same thread.
6231 For kernarg backing memory:
6233 * CP invalidates the L1 cache at the start of each kernel dispatch.
6234 * On dGPU over XGMI or PCIe the kernarg backing memory is allocated in host
6235 memory accessed as MTYPE UC (uncached) to avoid needing to invalidate the L2
6236 cache. This also causes it to be treated as non-volatile and so is not
6237 invalidated by ``*_vol``.
6238 * On APU the kernarg backing memory is accessed as MTYPE CC (cache coherent) and
6239 so the L2 cache will be coherent with the CPU and other agents.
6241 Scratch backing memory (which is used for the private address space) is accessed
6242 with MTYPE NC_NV (non-coherent non-volatile). Since the private address space is
6243 only accessed by a single thread, and is always write-before-read, there is
6244 never a need to invalidate these entries from the L1 cache. Hence all cache
6245 invalidates are done as ``*_vol`` to only invalidate the volatile cache lines.
6247 The code sequences used to implement the memory model for GFX90A are defined
6248 in table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx90a-table`.
6250 .. table:: AMDHSA Memory Model Code Sequences GFX90A
6251 :name: amdgpu-amdhsa-memory-model-code-sequences-gfx90a-table
6253 ============ ============ ============== ========== ================================
6254 LLVM Instr LLVM Memory LLVM Memory AMDGPU AMDGPU Machine Code
6255 Ordering Sync Scope Address GFX90A
6257 ============ ============ ============== ========== ================================
6259 ------------------------------------------------------------------------------------
6260 load *none* *none* - global - !volatile & !nontemporal
6262 - private 1. buffer/global/flat_load
6264 - !volatile & nontemporal
6266 1. buffer/global/flat_load
6271 1. buffer/global/flat_load
6273 2. s_waitcnt vmcnt(0)
6275 - Must happen before
6276 any following volatile
6287 load *none* *none* - local 1. ds_load
6288 store *none* *none* - global - !volatile & !nontemporal
6290 - private 1. buffer/global/flat_store
6292 - !volatile & nontemporal
6294 1. buffer/global/flat_store
6299 1. buffer/global/flat_store
6300 2. s_waitcnt vmcnt(0)
6302 - Must happen before
6303 any following volatile
6314 store *none* *none* - local 1. ds_store
6315 **Unordered Atomic**
6316 ------------------------------------------------------------------------------------
6317 load atomic unordered *any* *any* *Same as non-atomic*.
6318 store atomic unordered *any* *any* *Same as non-atomic*.
6319 atomicrmw unordered *any* *any* *Same as monotonic atomic*.
6320 **Monotonic Atomic**
6321 ------------------------------------------------------------------------------------
6322 load atomic monotonic - singlethread - global 1. buffer/global/flat_load
6323 - wavefront - generic
6324 load atomic monotonic - workgroup - global 1. buffer/global/flat_load
6327 - If not TgSplit execution
6330 load atomic monotonic - singlethread - local *If TgSplit execution mode,
6331 - wavefront local address space cannot
6332 - workgroup be used.*
6335 load atomic monotonic - agent - global 1. buffer/global/flat_load
6337 load atomic monotonic - system - global 1. buffer/global/flat_load
6339 store atomic monotonic - singlethread - global 1. buffer/global/flat_store
6340 - wavefront - generic
6343 store atomic monotonic - system - global 1. buffer/global/flat_store
6345 store atomic monotonic - singlethread - local *If TgSplit execution mode,
6346 - wavefront local address space cannot
6347 - workgroup be used.*
6350 atomicrmw monotonic - singlethread - global 1. buffer/global/flat_atomic
6351 - wavefront - generic
6354 atomicrmw monotonic - system - global 1. buffer/global/flat_atomic
6356 atomicrmw monotonic - singlethread - local *If TgSplit execution mode,
6357 - wavefront local address space cannot
6358 - workgroup be used.*
6362 ------------------------------------------------------------------------------------
6363 load atomic acquire - singlethread - global 1. buffer/global/ds/flat_load
6366 load atomic acquire - workgroup - global 1. buffer/global_load glc=1
6368 - If not TgSplit execution
6371 2. s_waitcnt vmcnt(0)
6373 - If not TgSplit execution
6375 - Must happen before the
6376 following buffer_wbinvl1_vol.
6378 3. buffer_wbinvl1_vol
6380 - If not TgSplit execution
6382 - Must happen before
6393 load atomic acquire - workgroup - local *If TgSplit execution mode,
6394 local address space cannot
6398 2. s_waitcnt lgkmcnt(0)
6401 - Must happen before
6410 older than the local load
6414 load atomic acquire - workgroup - generic 1. flat_load glc=1
6416 - If not TgSplit execution
6419 2. s_waitcnt lgkm/vmcnt(0)
6421 - Use lgkmcnt(0) if not
6422 TgSplit execution mode
6423 and vmcnt(0) if TgSplit
6425 - If OpenCL, omit lgkmcnt(0).
6426 - Must happen before
6428 buffer_wbinvl1_vol and any
6429 following global/generic
6436 older than a local load
6440 3. buffer_wbinvl1_vol
6442 - If not TgSplit execution
6449 load atomic acquire - agent - global 1. buffer/global_load
6451 2. s_waitcnt vmcnt(0)
6453 - Must happen before
6461 3. buffer_wbinvl1_vol
6463 - Must happen before
6473 load atomic acquire - system - global 1. buffer/global/flat_load
6475 2. s_waitcnt vmcnt(0)
6477 - Must happen before
6478 following buffer_invl2 and
6488 - Must happen before
6496 stale L1 global data,
6497 nor see stale L2 MTYPE
6499 MTYPE RW and CC memory will
6500 never be stale in L2 due to
6503 load atomic acquire - agent - generic 1. flat_load glc=1
6504 2. s_waitcnt vmcnt(0) &
6507 - If TgSplit execution mode,
6511 - Must happen before
6514 - Ensures the flat_load
6519 3. buffer_wbinvl1_vol
6521 - Must happen before
6531 load atomic acquire - system - generic 1. flat_load glc=1
6532 2. s_waitcnt vmcnt(0) &
6535 - If TgSplit execution mode,
6539 - Must happen before
6543 - Ensures the flat_load
6551 - Must happen before
6559 stale L1 global data,
6560 nor see stale L2 MTYPE
6562 MTYPE RW and CC memory will
6563 never be stale in L2 due to
6566 atomicrmw acquire - singlethread - global 1. buffer/global/flat_atomic
6567 - wavefront - generic
6568 atomicrmw acquire - singlethread - local *If TgSplit execution mode,
6569 - wavefront local address space cannot
6573 atomicrmw acquire - workgroup - global 1. buffer/global_atomic
6574 2. s_waitcnt vmcnt(0)
6576 - If not TgSplit execution
6578 - Must happen before the
6579 following buffer_wbinvl1_vol.
6580 - Ensures the atomicrmw
6585 3. buffer_wbinvl1_vol
6587 - If not TgSplit execution
6589 - Must happen before
6599 atomicrmw acquire - workgroup - local *If TgSplit execution mode,
6600 local address space cannot
6604 2. s_waitcnt lgkmcnt(0)
6607 - Must happen before
6616 older than the local
6620 atomicrmw acquire - workgroup - generic 1. flat_atomic
6621 2. s_waitcnt lgkm/vmcnt(0)
6623 - Use lgkmcnt(0) if not
6624 TgSplit execution mode
6625 and vmcnt(0) if TgSplit
6627 - If OpenCL, omit lgkmcnt(0).
6628 - Must happen before
6630 buffer_wbinvl1_vol and
6643 3. buffer_wbinvl1_vol
6645 - If not TgSplit execution
6652 atomicrmw acquire - agent - global 1. buffer/global_atomic
6653 2. s_waitcnt vmcnt(0)
6655 - Must happen before
6664 3. buffer_wbinvl1_vol
6666 - Must happen before
6676 atomicrmw acquire - system - global 1. buffer/global_atomic
6677 2. s_waitcnt vmcnt(0)
6679 - Must happen before
6680 following buffer_invl2 and
6691 - Must happen before
6699 stale L1 global data,
6700 nor see stale L2 MTYPE
6702 MTYPE RW and CC memory will
6703 never be stale in L2 due to
6706 atomicrmw acquire - agent - generic 1. flat_atomic
6707 2. s_waitcnt vmcnt(0) &
6710 - If TgSplit execution mode,
6714 - Must happen before
6723 3. buffer_wbinvl1_vol
6725 - Must happen before
6735 atomicrmw acquire - system - generic 1. flat_atomic
6736 2. s_waitcnt vmcnt(0) &
6739 - If TgSplit execution mode,
6743 - Must happen before
6756 - Must happen before
6764 stale L1 global data,
6765 nor see stale L2 MTYPE
6767 MTYPE RW and CC memory will
6768 never be stale in L2 due to
6771 fence acquire - singlethread *none* *none*
6773 fence acquire - workgroup *none* 1. s_waitcnt lgkm/vmcnt(0)
6775 - Use lgkmcnt(0) if not
6776 TgSplit execution mode
6777 and vmcnt(0) if TgSplit
6787 - However, since LLVM
6802 - s_waitcnt vmcnt(0)
6814 fence-paired-atomic).
6815 - s_waitcnt lgkmcnt(0)
6826 fence-paired-atomic).
6827 - Must happen before
6829 buffer_wbinvl1_vol and
6840 fence-paired-atomic.
6842 2. buffer_wbinvl1_vol
6844 - If not TgSplit execution
6851 fence acquire - agent *none* 1. s_waitcnt lgkmcnt(0) &
6854 - If TgSplit execution mode,
6860 - However, since LLVM
6868 - Could be split into
6877 - s_waitcnt vmcnt(0)
6888 fence-paired-atomic).
6889 - s_waitcnt lgkmcnt(0)
6900 fence-paired-atomic).
6901 - Must happen before
6915 fence-paired-atomic.
6917 2. buffer_wbinvl1_vol
6919 - Must happen before any
6920 following global/generic
6929 fence acquire - system *none* 1. s_waitcnt lgkmcnt(0) &
6932 - If TgSplit execution mode,
6938 - However, since LLVM
6946 - Could be split into
6955 - s_waitcnt vmcnt(0)
6966 fence-paired-atomic).
6967 - s_waitcnt lgkmcnt(0)
6978 fence-paired-atomic).
6979 - Must happen before
6980 the following buffer_invl2 and
6993 fence-paired-atomic.
6998 - Must happen before any
6999 following global/generic
7006 stale L1 global data,
7007 nor see stale L2 MTYPE
7009 MTYPE RW and CC memory will
7010 never be stale in L2 due to
7013 ------------------------------------------------------------------------------------
7014 store atomic release - singlethread - global 1. buffer/global/flat_store
7015 - wavefront - generic
7016 store atomic release - singlethread - local *If TgSplit execution mode,
7017 - wavefront local address space cannot
7021 store atomic release - workgroup - global 1. s_waitcnt lgkm/vmcnt(0)
7023 - Use lgkmcnt(0) if not
7024 TgSplit execution mode
7025 and vmcnt(0) if TgSplit
7027 - If OpenCL, omit lgkmcnt(0).
7028 - s_waitcnt vmcnt(0)
7031 global/generic load/store/
7032 load atomic/store atomic/
7034 - s_waitcnt lgkmcnt(0)
7041 - Must happen before
7052 2. buffer/global/flat_store
7053 store atomic release - workgroup - local *If TgSplit execution mode,
7054 local address space cannot
7058 store atomic release - agent - global 1. s_waitcnt lgkmcnt(0) &
7061 - If TgSplit execution mode,
7067 - Could be split into
7076 - s_waitcnt vmcnt(0)
7083 - s_waitcnt lgkmcnt(0)
7090 - Must happen before
7101 2. buffer/global/flat_store
7102 store atomic release - system - global 1. buffer_wbl2
7104 - Must happen before
7105 following s_waitcnt.
7106 - Performs L2 writeback to
7110 visible at system scope.
7112 2. s_waitcnt lgkmcnt(0) &
7115 - If TgSplit execution mode,
7121 - Could be split into
7130 - s_waitcnt vmcnt(0)
7131 must happen after any
7137 - s_waitcnt lgkmcnt(0)
7138 must happen after any
7144 - Must happen before
7149 to memory and the L2
7156 3. buffer/global/flat_store
7157 atomicrmw release - singlethread - global 1. buffer/global/flat_atomic
7158 - wavefront - generic
7159 atomicrmw release - singlethread - local *If TgSplit execution mode,
7160 - wavefront local address space cannot
7164 atomicrmw release - workgroup - global 1. s_waitcnt lgkm/vmcnt(0)
7166 - Use lgkmcnt(0) if not
7167 TgSplit execution mode
7168 and vmcnt(0) if TgSplit
7172 - s_waitcnt vmcnt(0)
7175 global/generic load/store/
7176 load atomic/store atomic/
7178 - s_waitcnt lgkmcnt(0)
7185 - Must happen before
7196 2. buffer/global/flat_atomic
7197 atomicrmw release - workgroup - local *If TgSplit execution mode,
7198 local address space cannot
7202 atomicrmw release - agent - global 1. s_waitcnt lgkmcnt(0) &
7205 - If TgSplit execution mode,
7209 - Could be split into
7218 - s_waitcnt vmcnt(0)
7225 - s_waitcnt lgkmcnt(0)
7232 - Must happen before
7243 2. buffer/global/flat_atomic
7244 atomicrmw release - system - global 1. buffer_wbl2
7246 - Must happen before
7247 following s_waitcnt.
7248 - Performs L2 writeback to
7252 visible at system scope.
7254 2. s_waitcnt lgkmcnt(0) &
7257 - If TgSplit execution mode,
7261 - Could be split into
7270 - s_waitcnt vmcnt(0)
7277 - s_waitcnt lgkmcnt(0)
7284 - Must happen before
7289 to memory and the L2
7296 3. buffer/global/flat_atomic
7297 fence release - singlethread *none* *none*
7299 fence release - workgroup *none* 1. s_waitcnt lgkm/vmcnt(0)
7301 - Use lgkmcnt(0) if not
7302 TgSplit execution mode
7303 and vmcnt(0) if TgSplit
7313 - However, since LLVM
7328 - s_waitcnt vmcnt(0)
7333 load atomic/store atomic/
7335 - s_waitcnt lgkmcnt(0)
7342 - Must happen before
7351 fence-paired-atomic).
7358 fence-paired-atomic.
7360 fence release - agent *none* 1. s_waitcnt lgkmcnt(0) &
7363 - If TgSplit execution mode,
7373 - However, since LLVM
7388 - Could be split into
7397 - s_waitcnt vmcnt(0)
7404 - s_waitcnt lgkmcnt(0)
7411 - Must happen before
7420 fence-paired-atomic).
7427 fence-paired-atomic.
7429 fence release - system *none* 1. buffer_wbl2
7434 - Must happen before
7435 following s_waitcnt.
7436 - Performs L2 writeback to
7440 visible at system scope.
7442 2. s_waitcnt lgkmcnt(0) &
7445 - If TgSplit execution mode,
7455 - However, since LLVM
7470 - Could be split into
7479 - s_waitcnt vmcnt(0)
7486 - s_waitcnt lgkmcnt(0)
7493 - Must happen before
7502 fence-paired-atomic).
7509 fence-paired-atomic.
7511 **Acquire-Release Atomic**
7512 ------------------------------------------------------------------------------------
7513 atomicrmw acq_rel - singlethread - global 1. buffer/global/flat_atomic
7514 - wavefront - generic
7515 atomicrmw acq_rel - singlethread - local *If TgSplit execution mode,
7516 - wavefront local address space cannot
7520 atomicrmw acq_rel - workgroup - global 1. s_waitcnt lgkm/vmcnt(0)
7522 - Use lgkmcnt(0) if not
7523 TgSplit execution mode
7524 and vmcnt(0) if TgSplit
7534 - s_waitcnt vmcnt(0)
7537 global/generic load/store/
7538 load atomic/store atomic/
7540 - s_waitcnt lgkmcnt(0)
7547 - Must happen before
7558 2. buffer/global_atomic
7559 3. s_waitcnt vmcnt(0)
7561 - If not TgSplit execution
7563 - Must happen before
7573 4. buffer_wbinvl1_vol
7575 - If not TgSplit execution
7582 atomicrmw acq_rel - workgroup - local *If TgSplit execution mode,
7583 local address space cannot
7587 2. s_waitcnt lgkmcnt(0)
7590 - Must happen before
7599 older than the local load
7603 atomicrmw acq_rel - workgroup - generic 1. s_waitcnt lgkm/vmcnt(0)
7605 - Use lgkmcnt(0) if not
7606 TgSplit execution mode
7607 and vmcnt(0) if TgSplit
7611 - s_waitcnt vmcnt(0)
7614 global/generic load/store/
7615 load atomic/store atomic/
7617 - s_waitcnt lgkmcnt(0)
7624 - Must happen before
7636 3. s_waitcnt lgkmcnt(0) &
7639 - If not TgSplit execution
7640 mode, omit vmcnt(0).
7643 - Must happen before
7645 buffer_wbinvl1_vol and
7654 older than a local load
7658 3. buffer_wbinvl1_vol
7660 - If not TgSplit execution
7667 atomicrmw acq_rel - agent - global 1. s_waitcnt lgkmcnt(0) &
7670 - If TgSplit execution mode,
7674 - Could be split into
7683 - s_waitcnt vmcnt(0)
7690 - s_waitcnt lgkmcnt(0)
7697 - Must happen before
7708 2. buffer/global_atomic
7709 3. s_waitcnt vmcnt(0)
7711 - Must happen before
7720 4. buffer_wbinvl1_vol
7722 - Must happen before
7732 atomicrmw acq_rel - system - global 1. buffer_wbl2
7734 - Must happen before
7735 following s_waitcnt.
7736 - Performs L2 writeback to
7740 visible at system scope.
7742 2. s_waitcnt lgkmcnt(0) &
7745 - If TgSplit execution mode,
7749 - Could be split into
7758 - s_waitcnt vmcnt(0)
7765 - s_waitcnt lgkmcnt(0)
7772 - Must happen before
7777 to global and L2 writeback
7778 have completed before
7783 3. buffer/global_atomic
7784 4. s_waitcnt vmcnt(0)
7786 - Must happen before
7787 following buffer_invl2 and
7798 - Must happen before
7806 stale L1 global data,
7807 nor see stale L2 MTYPE
7809 MTYPE RW and CC memory will
7810 never be stale in L2 due to
7813 atomicrmw acq_rel - agent - generic 1. s_waitcnt lgkmcnt(0) &
7816 - If TgSplit execution mode,
7820 - Could be split into
7829 - s_waitcnt vmcnt(0)
7836 - s_waitcnt lgkmcnt(0)
7843 - Must happen before
7855 3. s_waitcnt vmcnt(0) &
7858 - If TgSplit execution mode,
7862 - Must happen before
7871 4. buffer_wbinvl1_vol
7873 - Must happen before
7883 atomicrmw acq_rel - system - generic 1. buffer_wbl2
7885 - Must happen before
7886 following s_waitcnt.
7887 - Performs L2 writeback to
7891 visible at system scope.
7893 2. s_waitcnt lgkmcnt(0) &
7896 - If TgSplit execution mode,
7900 - Could be split into
7909 - s_waitcnt vmcnt(0)
7916 - s_waitcnt lgkmcnt(0)
7923 - Must happen before
7928 to global and L2 writeback
7929 have completed before
7935 4. s_waitcnt vmcnt(0) &
7938 - If TgSplit execution mode,
7942 - Must happen before
7943 following buffer_invl2 and
7954 - Must happen before
7962 stale L1 global data,
7963 nor see stale L2 MTYPE
7965 MTYPE RW and CC memory will
7966 never be stale in L2 due to
7969 fence acq_rel - singlethread *none* *none*
7971 fence acq_rel - workgroup *none* 1. s_waitcnt lgkm/vmcnt(0)
7973 - Use lgkmcnt(0) if not
7974 TgSplit execution mode
7975 and vmcnt(0) if TgSplit
7994 - s_waitcnt vmcnt(0)
7999 load atomic/store atomic/
8001 - s_waitcnt lgkmcnt(0)
8008 - Must happen before
8031 acquire-fence-paired-atomic)
8052 release-fence-paired-atomic).
8056 - Must happen before
8060 acquire-fence-paired
8061 atomic has completed
8070 acquire-fence-paired-atomic.
8072 2. buffer_wbinvl1_vol
8074 - If not TgSplit execution
8081 fence acq_rel - agent *none* 1. s_waitcnt lgkmcnt(0) &
8084 - If TgSplit execution mode,
8090 - However, since LLVM
8098 - Could be split into
8107 - s_waitcnt vmcnt(0)
8114 - s_waitcnt lgkmcnt(0)
8121 - Must happen before
8126 global/local/generic
8135 acquire-fence-paired-atomic)
8147 global/local/generic
8156 release-fence-paired-atomic).
8161 2. buffer_wbinvl1_vol
8163 - Must happen before
8177 fence acq_rel - system *none* 1. buffer_wbl2
8182 - Must happen before
8183 following s_waitcnt.
8184 - Performs L2 writeback to
8188 visible at system scope.
8190 2. s_waitcnt lgkmcnt(0) &
8193 - If TgSplit execution mode,
8199 - However, since LLVM
8207 - Could be split into
8216 - s_waitcnt vmcnt(0)
8223 - s_waitcnt lgkmcnt(0)
8230 - Must happen before
8231 the following buffer_invl2 and
8235 global/local/generic
8244 acquire-fence-paired-atomic)
8256 global/local/generic
8265 release-fence-paired-atomic).
8273 - Must happen before
8282 stale L1 global data,
8283 nor see stale L2 MTYPE
8285 MTYPE RW and CC memory will
8286 never be stale in L2 due to
8289 **Sequential Consistent Atomic**
8290 ------------------------------------------------------------------------------------
8291 load atomic seq_cst - singlethread - global *Same as corresponding
8292 - wavefront - local load atomic acquire,
8293 - generic except must generate
8294 all instructions even
8296 load atomic seq_cst - workgroup - global 1. s_waitcnt lgkm/vmcnt(0)
8298 - Use lgkmcnt(0) if not
8299 TgSplit execution mode
8300 and vmcnt(0) if TgSplit
8302 - s_waitcnt lgkmcnt(0) must
8315 lgkmcnt(0) and so do
8318 - s_waitcnt vmcnt(0)
8337 consistent global/local
8363 order. The s_waitcnt
8364 could be placed after
8368 make the s_waitcnt be
8375 instructions same as
8378 except must generate
8379 all instructions even
8381 load atomic seq_cst - workgroup - local *If TgSplit execution mode,
8382 local address space cannot
8385 *Same as corresponding
8386 load atomic acquire,
8387 except must generate
8388 all instructions even
8391 load atomic seq_cst - agent - global 1. s_waitcnt lgkmcnt(0) &
8392 - system - generic vmcnt(0)
8394 - If TgSplit execution mode,
8396 - Could be split into
8405 - s_waitcnt lgkmcnt(0)
8418 lgkmcnt(0) and so do
8421 - s_waitcnt vmcnt(0)
8466 order. The s_waitcnt
8467 could be placed after
8471 make the s_waitcnt be
8478 instructions same as
8481 except must generate
8482 all instructions even
8484 store atomic seq_cst - singlethread - global *Same as corresponding
8485 - wavefront - local store atomic release,
8486 - workgroup - generic except must generate
8487 - agent all instructions even
8488 - system for OpenCL.*
8489 atomicrmw seq_cst - singlethread - global *Same as corresponding
8490 - wavefront - local atomicrmw acq_rel,
8491 - workgroup - generic except must generate
8492 - agent all instructions even
8493 - system for OpenCL.*
8494 fence seq_cst - singlethread *none* *Same as corresponding
8495 - wavefront fence acq_rel,
8496 - workgroup except must generate
8497 - agent all instructions even
8498 - system for OpenCL.*
8499 ============ ============ ============== ========== ================================
8501 .. _amdgpu-amdhsa-memory-model-gfx10:
8508 * Each agent has multiple shader arrays (SA).
8509 * Each SA has multiple work-group processors (WGP).
8510 * Each WGP has multiple compute units (CU).
8511 * Each CU has multiple SIMDs that execute wavefronts.
8512 * The wavefronts for a single work-group are executed in the same
8513 WGP. In CU wavefront execution mode the wavefronts may be executed by
8514 different SIMDs in the same CU. In WGP wavefront execution mode the
8515 wavefronts may be executed by different SIMDs in different CUs in the same
8517 * Each WGP has a single LDS memory shared by the wavefronts of the work-groups
8519 * All LDS operations of a WGP are performed as wavefront wide operations in a
8520 global order and involve no caching. Completion is reported to a wavefront in
8522 * The LDS memory has multiple request queues shared by the SIMDs of a
8523 WGP. Therefore, the LDS operations performed by different wavefronts of a
8524 work-group can be reordered relative to each other, which can result in
8525 reordering the visibility of vector memory operations with respect to LDS
8526 operations of other wavefronts in the same work-group. A ``s_waitcnt
8527 lgkmcnt(0)`` is required to ensure synchronization between LDS operations and
8528 vector memory operations between wavefronts of a work-group, but not between
8529 operations performed by the same wavefront.
8530 * The vector memory operations are performed as wavefront wide operations.
8531 Completion of load/store/sample operations are reported to a wavefront in
8532 execution order of other load/store/sample operations performed by that
8534 * The vector memory operations access a vector L0 cache. There is a single L0
8535 cache per CU. Each SIMD of a CU accesses the same L0 cache. Therefore, no
8536 special action is required for coherence between the lanes of a single
8537 wavefront. However, a ``buffer_gl0_inv`` is required for coherence between
8538 wavefronts executing in the same work-group as they may be executing on SIMDs
8539 of different CUs that access different L0s. A ``buffer_gl0_inv`` is also
8540 required for coherence between wavefronts executing in different work-groups
8541 as they may be executing on different WGPs.
8542 * The scalar memory operations access a scalar L0 cache shared by all wavefronts
8543 on a WGP. The scalar and vector L0 caches are not coherent. However, scalar
8544 operations are used in a restricted way so do not impact the memory model. See
8545 :ref:`amdgpu-amdhsa-memory-spaces`.
8546 * The vector and scalar memory L0 caches use an L1 cache shared by all WGPs on
8547 the same SA. Therefore, no special action is required for coherence between
8548 the wavefronts of a single work-group. However, a ``buffer_gl1_inv`` is
8549 required for coherence between wavefronts executing in different work-groups
8550 as they may be executing on different SAs that access different L1s.
8551 * The L1 caches have independent quadrants to service disjoint ranges of virtual
8553 * Each L0 cache has a separate request queue per L1 quadrant. Therefore, the
8554 vector and scalar memory operations performed by different wavefronts, whether
8555 executing in the same or different work-groups (which may be executing on
8556 different CUs accessing different L0s), can be reordered relative to each
8557 other. A ``s_waitcnt vmcnt(0) & vscnt(0)`` is required to ensure
8558 synchronization between vector memory operations of different wavefronts. It
8559 ensures a previous vector memory operation has completed before executing a
8560 subsequent vector memory or LDS operation and so can be used to meet the
8561 requirements of acquire, release and sequential consistency.
8562 * The L1 caches use an L2 cache shared by all SAs on the same agent.
8563 * The L2 cache has independent channels to service disjoint ranges of virtual
8565 * Each L1 quadrant of a single SA accesses a different L2 channel. Each L1
8566 quadrant has a separate request queue per L2 channel. Therefore, the vector
8567 and scalar memory operations performed by wavefronts executing in different
8568 work-groups (which may be executing on different SAs) of an agent can be
8569 reordered relative to each other. A ``s_waitcnt vmcnt(0) & vscnt(0)`` is
8570 required to ensure synchronization between vector memory operations of
8571 different SAs. It ensures a previous vector memory operation has completed
8572 before executing a subsequent vector memory and so can be used to meet the
8573 requirements of acquire, release and sequential consistency.
8574 * The L2 cache can be kept coherent with other agents on some targets, or ranges
8575 of virtual addresses can be set up to bypass it to ensure system coherence.
8576 * On GFX10.3 a memory attached last level (MALL) cache exists for GPU memory.
8577 The MALL cache is fully coherent with GPU memory and has no impact on system
8578 coherence. All agents (GPU and CPU) access GPU memory through the MALL cache.
8580 Scalar memory operations are only used to access memory that is proven to not
8581 change during the execution of the kernel dispatch. This includes constant
8582 address space and global address space for program scope ``const`` variables.
8583 Therefore, the kernel machine code does not have to maintain the scalar cache to
8584 ensure it is coherent with the vector caches. The scalar and vector caches are
8585 invalidated between kernel dispatches by CP since constant address space data
8586 may change between kernel dispatch executions. See
8587 :ref:`amdgpu-amdhsa-memory-spaces`.
8589 The one exception is if scalar writes are used to spill SGPR registers. In this
8590 case the AMDGPU backend ensures the memory location used to spill is never
8591 accessed by vector memory operations at the same time. If scalar writes are used
8592 then a ``s_dcache_wb`` is inserted before the ``s_endpgm`` and before a function
8593 return since the locations may be used for vector memory instructions by a
8594 future wavefront that uses the same scratch area, or a function call that
8595 creates a frame at the same address, respectively. There is no need for a
8596 ``s_dcache_inv`` as all scalar writes are write-before-read in the same thread.
8598 For kernarg backing memory:
8600 * CP invalidates the L0 and L1 caches at the start of each kernel dispatch.
8601 * On dGPU the kernarg backing memory is accessed as MTYPE UC (uncached) to avoid
8602 needing to invalidate the L2 cache.
8603 * On APU the kernarg backing memory is accessed as MTYPE CC (cache coherent) and
8604 so the L2 cache will be coherent with the CPU and other agents.
8606 Scratch backing memory (which is used for the private address space) is accessed
8607 with MTYPE NC (non-coherent). Since the private address space is only accessed
8608 by a single thread, and is always write-before-read, there is never a need to
8609 invalidate these entries from the L0 or L1 caches.
8611 Wavefronts are executed in native mode with in-order reporting of loads and
8612 sample instructions. In this mode vmcnt reports completion of load, atomic with
8613 return and sample instructions in order, and the vscnt reports the completion of
8614 store and atomic without return in order. See ``MEM_ORDERED`` field in
8615 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
8617 Wavefronts can be executed in WGP or CU wavefront execution mode:
8619 * In WGP wavefront execution mode the wavefronts of a work-group are executed
8620 on the SIMDs of both CUs of the WGP. Therefore, explicit management of the per
8621 CU L0 caches is required for work-group synchronization. Also accesses to L1
8622 at work-group scope need to be explicitly ordered as the accesses from
8623 different CUs are not ordered.
8624 * In CU wavefront execution mode the wavefronts of a work-group are executed on
8625 the SIMDs of a single CU of the WGP. Therefore, all global memory access by
8626 the work-group access the same L0 which in turn ensures L1 accesses are
8627 ordered and so do not require explicit management of the caches for
8628 work-group synchronization.
8630 See ``WGP_MODE`` field in
8631 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table` and
8632 :ref:`amdgpu-target-features`.
8634 The code sequences used to implement the memory model for GFX10 are defined in
8635 table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx10-table`.
8637 .. table:: AMDHSA Memory Model Code Sequences GFX10
8638 :name: amdgpu-amdhsa-memory-model-code-sequences-gfx10-table
8640 ============ ============ ============== ========== ================================
8641 LLVM Instr LLVM Memory LLVM Memory AMDGPU AMDGPU Machine Code
8642 Ordering Sync Scope Address GFX10
8644 ============ ============ ============== ========== ================================
8646 ------------------------------------------------------------------------------------
8647 load *none* *none* - global - !volatile & !nontemporal
8649 - private 1. buffer/global/flat_load
8651 - !volatile & nontemporal
8653 1. buffer/global/flat_load
8658 1. buffer/global/flat_load
8660 2. s_waitcnt vmcnt(0)
8662 - Must happen before
8663 any following volatile
8674 load *none* *none* - local 1. ds_load
8675 store *none* *none* - global - !volatile & !nontemporal
8677 - private 1. buffer/global/flat_store
8679 - !volatile & nontemporal
8681 1. buffer/global/flat_store
8686 1. buffer/global/flat_store
8687 2. s_waitcnt vscnt(0)
8689 - Must happen before
8690 any following volatile
8701 store *none* *none* - local 1. ds_store
8702 **Unordered Atomic**
8703 ------------------------------------------------------------------------------------
8704 load atomic unordered *any* *any* *Same as non-atomic*.
8705 store atomic unordered *any* *any* *Same as non-atomic*.
8706 atomicrmw unordered *any* *any* *Same as monotonic atomic*.
8707 **Monotonic Atomic**
8708 ------------------------------------------------------------------------------------
8709 load atomic monotonic - singlethread - global 1. buffer/global/flat_load
8710 - wavefront - generic
8711 load atomic monotonic - workgroup - global 1. buffer/global/flat_load
8714 - If CU wavefront execution
8717 load atomic monotonic - singlethread - local 1. ds_load
8720 load atomic monotonic - agent - global 1. buffer/global/flat_load
8721 - system - generic glc=1 dlc=1
8722 store atomic monotonic - singlethread - global 1. buffer/global/flat_store
8723 - wavefront - generic
8727 store atomic monotonic - singlethread - local 1. ds_store
8730 atomicrmw monotonic - singlethread - global 1. buffer/global/flat_atomic
8731 - wavefront - generic
8735 atomicrmw monotonic - singlethread - local 1. ds_atomic
8739 ------------------------------------------------------------------------------------
8740 load atomic acquire - singlethread - global 1. buffer/global/ds/flat_load
8743 load atomic acquire - workgroup - global 1. buffer/global_load glc=1
8745 - If CU wavefront execution
8748 2. s_waitcnt vmcnt(0)
8750 - If CU wavefront execution
8752 - Must happen before
8753 the following buffer_gl0_inv
8754 and before any following
8762 - If CU wavefront execution
8769 load atomic acquire - workgroup - local 1. ds_load
8770 2. s_waitcnt lgkmcnt(0)
8773 - Must happen before
8774 the following buffer_gl0_inv
8775 and before any following
8776 global/generic load/load
8782 older than the local load
8788 - If CU wavefront execution
8796 load atomic acquire - workgroup - generic 1. flat_load glc=1
8798 - If CU wavefront execution
8801 2. s_waitcnt lgkmcnt(0) &
8804 - If CU wavefront execution
8805 mode, omit vmcnt(0).
8808 - Must happen before
8810 buffer_gl0_inv and any
8811 following global/generic
8818 older than a local load
8824 - If CU wavefront execution
8831 load atomic acquire - agent - global 1. buffer/global_load
8832 - system glc=1 dlc=1
8833 2. s_waitcnt vmcnt(0)
8835 - Must happen before
8846 - Must happen before
8856 load atomic acquire - agent - generic 1. flat_load glc=1 dlc=1
8857 - system 2. s_waitcnt vmcnt(0) &
8862 - Must happen before
8865 - Ensures the flat_load
8873 - Must happen before
8883 atomicrmw acquire - singlethread - global 1. buffer/global/ds/flat_atomic
8886 atomicrmw acquire - workgroup - global 1. buffer/global_atomic
8887 2. s_waitcnt vm/vscnt(0)
8889 - If CU wavefront execution
8891 - Use vmcnt(0) if atomic with
8892 return and vscnt(0) if
8893 atomic with no-return.
8894 - Must happen before
8895 the following buffer_gl0_inv
8896 and before any following
8904 - If CU wavefront execution
8911 atomicrmw acquire - workgroup - local 1. ds_atomic
8912 2. s_waitcnt lgkmcnt(0)
8915 - Must happen before
8921 older than the local
8933 atomicrmw acquire - workgroup - generic 1. flat_atomic
8934 2. s_waitcnt lgkmcnt(0) &
8937 - If CU wavefront execution
8938 mode, omit vm/vscnt(0).
8939 - If OpenCL, omit lgkmcnt(0).
8940 - Use vmcnt(0) if atomic with
8941 return and vscnt(0) if
8942 atomic with no-return.
8943 - Must happen before
8955 - If CU wavefront execution
8962 atomicrmw acquire - agent - global 1. buffer/global_atomic
8963 - system 2. s_waitcnt vm/vscnt(0)
8965 - Use vmcnt(0) if atomic with
8966 return and vscnt(0) if
8967 atomic with no-return.
8968 - Must happen before
8980 - Must happen before
8990 atomicrmw acquire - agent - generic 1. flat_atomic
8991 - system 2. s_waitcnt vm/vscnt(0) &
8996 - Use vmcnt(0) if atomic with
8997 return and vscnt(0) if
8998 atomic with no-return.
8999 - Must happen before
9011 - Must happen before
9021 fence acquire - singlethread *none* *none*
9023 fence acquire - workgroup *none* 1. s_waitcnt lgkmcnt(0) &
9026 - If CU wavefront execution
9027 mode, omit vmcnt(0) and
9036 vmcnt(0) and vscnt(0).
9037 - However, since LLVM
9052 - Could be split into
9055 vscnt(0) and s_waitcnt
9061 - s_waitcnt vmcnt(0)
9066 atomicrmw-with-return-value
9073 fence-paired-atomic).
9074 - s_waitcnt vscnt(0)
9078 atomicrmw-no-return-value
9085 fence-paired-atomic).
9086 - s_waitcnt lgkmcnt(0)
9097 fence-paired-atomic).
9098 - Must happen before
9112 fence-paired-atomic.
9116 - If CU wavefront execution
9123 fence acquire - agent *none* 1. s_waitcnt lgkmcnt(0) &
9124 - system vmcnt(0) & vscnt(0)
9133 vmcnt(0) and vscnt(0).
9134 - However, since LLVM
9142 - Could be split into
9145 vscnt(0) and s_waitcnt
9151 - s_waitcnt vmcnt(0)
9156 atomicrmw-with-return-value
9163 fence-paired-atomic).
9164 - s_waitcnt vscnt(0)
9168 atomicrmw-no-return-value
9175 fence-paired-atomic).
9176 - s_waitcnt lgkmcnt(0)
9187 fence-paired-atomic).
9188 - Must happen before
9202 fence-paired-atomic.
9207 - Must happen before any
9208 following global/generic
9218 ------------------------------------------------------------------------------------
9219 store atomic release - singlethread - global 1. buffer/global/ds/flat_store
9222 store atomic release - workgroup - global 1. s_waitcnt lgkmcnt(0) &
9223 - generic vmcnt(0) & vscnt(0)
9225 - If CU wavefront execution
9226 mode, omit vmcnt(0) and
9230 - Could be split into
9233 vscnt(0) and s_waitcnt
9239 - s_waitcnt vmcnt(0)
9242 global/generic load/load
9244 atomicrmw-with-return-value.
9245 - s_waitcnt vscnt(0)
9251 atomicrmw-no-return-value.
9252 - s_waitcnt lgkmcnt(0)
9259 - Must happen before
9270 2. buffer/global/flat_store
9271 store atomic release - workgroup - local 1. s_waitcnt vmcnt(0) & vscnt(0)
9273 - If CU wavefront execution
9276 - Could be split into
9278 vmcnt(0) and s_waitcnt
9284 - s_waitcnt vmcnt(0)
9287 global/generic load/load
9289 atomicrmw-with-return-value.
9290 - s_waitcnt vscnt(0)
9295 atomicrmw-no-return-value.
9296 - Must happen before
9308 store atomic release - agent - global 1. s_waitcnt lgkmcnt(0) &
9309 - system - generic vmcnt(0) & vscnt(0)
9315 - Could be split into
9317 vmcnt(0), s_waitcnt vscnt(0)
9324 - s_waitcnt vmcnt(0)
9330 atomicrmw-with-return-value.
9331 - s_waitcnt vscnt(0)
9336 atomicrmw-no-return-value.
9337 - s_waitcnt lgkmcnt(0)
9344 - Must happen before
9355 2. buffer/global/flat_store
9356 atomicrmw release - singlethread - global 1. buffer/global/ds/flat_atomic
9359 atomicrmw release - workgroup - global 1. s_waitcnt lgkmcnt(0) &
9360 - generic vmcnt(0) & vscnt(0)
9362 - If CU wavefront execution
9363 mode, omit vmcnt(0) and
9365 - If OpenCL, omit lgkmcnt(0).
9366 - Could be split into
9369 vscnt(0) and s_waitcnt
9375 - s_waitcnt vmcnt(0)
9378 global/generic load/load
9380 atomicrmw-with-return-value.
9381 - s_waitcnt vscnt(0)
9387 atomicrmw-no-return-value.
9388 - s_waitcnt lgkmcnt(0)
9395 - Must happen before
9406 2. buffer/global/flat_atomic
9407 atomicrmw release - workgroup - local 1. s_waitcnt vmcnt(0) & vscnt(0)
9409 - If CU wavefront execution
9412 - Could be split into
9414 vmcnt(0) and s_waitcnt
9420 - s_waitcnt vmcnt(0)
9423 global/generic load/load
9425 atomicrmw-with-return-value.
9426 - s_waitcnt vscnt(0)
9431 atomicrmw-no-return-value.
9432 - Must happen before
9444 atomicrmw release - agent - global 1. s_waitcnt lgkmcnt(0) &
9445 - system - generic vmcnt(0) & vscnt(0)
9449 - Could be split into
9452 vscnt(0) and s_waitcnt
9458 - s_waitcnt vmcnt(0)
9463 atomicrmw-with-return-value.
9464 - s_waitcnt vscnt(0)
9469 atomicrmw-no-return-value.
9470 - s_waitcnt lgkmcnt(0)
9477 - Must happen before
9488 2. buffer/global/flat_atomic
9489 fence release - singlethread *none* *none*
9491 fence release - workgroup *none* 1. s_waitcnt lgkmcnt(0) &
9494 - If CU wavefront execution
9495 mode, omit vmcnt(0) and
9504 vmcnt(0) and vscnt(0).
9505 - However, since LLVM
9520 - Could be split into
9523 vscnt(0) and s_waitcnt
9529 - s_waitcnt vmcnt(0)
9535 atomicrmw-with-return-value.
9536 - s_waitcnt vscnt(0)
9541 atomicrmw-no-return-value.
9542 - s_waitcnt lgkmcnt(0)
9547 atomic/store atomic/
9549 - Must happen before
9558 fence-paired-atomic).
9565 fence-paired-atomic.
9567 fence release - agent *none* 1. s_waitcnt lgkmcnt(0) &
9568 - system vmcnt(0) & vscnt(0)
9577 vmcnt(0) and vscnt(0).
9578 - However, since LLVM
9593 - Could be split into
9596 vscnt(0) and s_waitcnt
9602 - s_waitcnt vmcnt(0)
9607 atomicrmw-with-return-value.
9608 - s_waitcnt vscnt(0)
9613 atomicrmw-no-return-value.
9614 - s_waitcnt lgkmcnt(0)
9621 - Must happen before
9630 fence-paired-atomic).
9637 fence-paired-atomic.
9639 **Acquire-Release Atomic**
9640 ------------------------------------------------------------------------------------
9641 atomicrmw acq_rel - singlethread - global 1. buffer/global/ds/flat_atomic
9644 atomicrmw acq_rel - workgroup - global 1. s_waitcnt lgkmcnt(0) &
9647 - If CU wavefront execution
9648 mode, omit vmcnt(0) and
9658 - Could be split into
9661 vscnt(0), and s_waitcnt
9667 - s_waitcnt vmcnt(0)
9670 global/generic load/load
9672 atomicrmw-with-return-value.
9673 - s_waitcnt vscnt(0)
9679 atomicrmw-no-return-value.
9680 - s_waitcnt lgkmcnt(0)
9687 - Must happen before
9698 2. buffer/global_atomic
9699 3. s_waitcnt vm/vscnt(0)
9701 - If CU wavefront execution
9703 - Use vmcnt(0) if atomic with
9704 return and vscnt(0) if
9705 atomic with no-return.
9706 - Must happen before
9718 - If CU wavefront execution
9725 atomicrmw acq_rel - workgroup - local 1. s_waitcnt vmcnt(0) & vscnt(0)
9727 - If CU wavefront execution
9730 - Could be split into
9732 vmcnt(0) and s_waitcnt
9738 - s_waitcnt vmcnt(0)
9741 global/generic load/load
9743 atomicrmw-with-return-value.
9744 - s_waitcnt vscnt(0)
9749 atomicrmw-no-return-value.
9750 - Must happen before
9762 3. s_waitcnt lgkmcnt(0)
9765 - Must happen before
9771 older than the local load
9777 - If CU wavefront execution
9785 atomicrmw acq_rel - workgroup - generic 1. s_waitcnt lgkmcnt(0) &
9788 - If CU wavefront execution
9789 mode, omit vmcnt(0) and
9791 - If OpenCL, omit lgkmcnt(0).
9792 - Could be split into
9795 vscnt(0) and s_waitcnt
9801 - s_waitcnt vmcnt(0)
9804 global/generic load/load
9806 atomicrmw-with-return-value.
9807 - s_waitcnt vscnt(0)
9813 atomicrmw-no-return-value.
9814 - s_waitcnt lgkmcnt(0)
9821 - Must happen before
9833 3. s_waitcnt lgkmcnt(0) &
9836 - If CU wavefront execution
9837 mode, omit vmcnt(0) and
9839 - If OpenCL, omit lgkmcnt(0).
9840 - Must happen before
9852 - If CU wavefront execution
9859 atomicrmw acq_rel - agent - global 1. s_waitcnt lgkmcnt(0) &
9860 - system vmcnt(0) & vscnt(0)
9864 - Could be split into
9867 vscnt(0) and s_waitcnt
9873 - s_waitcnt vmcnt(0)
9878 atomicrmw-with-return-value.
9879 - s_waitcnt vscnt(0)
9884 atomicrmw-no-return-value.
9885 - s_waitcnt lgkmcnt(0)
9892 - Must happen before
9903 2. buffer/global_atomic
9904 3. s_waitcnt vm/vscnt(0)
9906 - Use vmcnt(0) if atomic with
9907 return and vscnt(0) if
9908 atomic with no-return.
9909 - Must happen before
9921 - Must happen before
9931 atomicrmw acq_rel - agent - generic 1. s_waitcnt lgkmcnt(0) &
9932 - system vmcnt(0) & vscnt(0)
9936 - Could be split into
9939 vscnt(0), and s_waitcnt
9945 - s_waitcnt vmcnt(0)
9950 atomicrmw-with-return-value.
9951 - s_waitcnt vscnt(0)
9956 atomicrmw-no-return-value.
9957 - s_waitcnt lgkmcnt(0)
9964 - Must happen before
9976 3. s_waitcnt vm/vscnt(0) &
9981 - Use vmcnt(0) if atomic with
9982 return and vscnt(0) if
9983 atomic with no-return.
9984 - Must happen before
9996 - Must happen before
10006 fence acq_rel - singlethread *none* *none*
10008 fence acq_rel - workgroup *none* 1. s_waitcnt lgkmcnt(0) &
10009 vmcnt(0) & vscnt(0)
10011 - If CU wavefront execution
10012 mode, omit vmcnt(0) and
10021 vmcnt(0) and vscnt(0).
10031 - Could be split into
10033 vmcnt(0), s_waitcnt
10034 vscnt(0) and s_waitcnt
10035 lgkmcnt(0) to allow
10037 independently moved
10040 - s_waitcnt vmcnt(0)
10046 atomicrmw-with-return-value.
10047 - s_waitcnt vscnt(0)
10051 store/store atomic/
10052 atomicrmw-no-return-value.
10053 - s_waitcnt lgkmcnt(0)
10058 atomic/store atomic/
10060 - Must happen before
10079 and memory ordering
10083 acquire-fence-paired-atomic)
10096 local/generic store
10100 and memory ordering
10104 release-fence-paired-atomic).
10108 - Must happen before
10112 acquire-fence-paired
10113 atomic has completed
10114 before invalidating
10118 locations read must
10122 acquire-fence-paired-atomic.
10126 - If CU wavefront execution
10133 fence acq_rel - agent *none* 1. s_waitcnt lgkmcnt(0) &
10134 - system vmcnt(0) & vscnt(0)
10143 vmcnt(0) and vscnt(0).
10144 - However, since LLVM
10152 - Could be split into
10154 vmcnt(0), s_waitcnt
10155 vscnt(0) and s_waitcnt
10156 lgkmcnt(0) to allow
10158 independently moved
10161 - s_waitcnt vmcnt(0)
10167 atomicrmw-with-return-value.
10168 - s_waitcnt vscnt(0)
10172 store/store atomic/
10173 atomicrmw-no-return-value.
10174 - s_waitcnt lgkmcnt(0)
10181 - Must happen before
10186 global/local/generic
10191 and memory ordering
10195 acquire-fence-paired-atomic)
10197 before invalidating
10207 global/local/generic
10212 and memory ordering
10216 release-fence-paired-atomic).
10224 - Must happen before
10238 **Sequential Consistent Atomic**
10239 ------------------------------------------------------------------------------------
10240 load atomic seq_cst - singlethread - global *Same as corresponding
10241 - wavefront - local load atomic acquire,
10242 - generic except must generate
10243 all instructions even
10245 load atomic seq_cst - workgroup - global 1. s_waitcnt lgkmcnt(0) &
10246 - generic vmcnt(0) & vscnt(0)
10248 - If CU wavefront execution
10249 mode, omit vmcnt(0) and
10251 - Could be split into
10253 vmcnt(0), s_waitcnt
10254 vscnt(0), and s_waitcnt
10255 lgkmcnt(0) to allow
10257 independently moved
10260 - s_waitcnt lgkmcnt(0) must
10267 ordering of seq_cst
10273 lgkmcnt(0) and so do
10276 - s_waitcnt vmcnt(0)
10279 global/generic load
10281 atomicrmw-with-return-value
10283 ordering of seq_cst
10292 - s_waitcnt vscnt(0)
10295 global/generic store
10297 atomicrmw-no-return-value
10299 ordering of seq_cst
10311 consistent global/local
10312 memory instructions
10318 prevents reordering
10321 seq_cst load. (Note
10327 followed by a store
10334 release followed by
10337 order. The s_waitcnt
10338 could be placed after
10339 seq_store or before
10342 make the s_waitcnt be
10343 as late as possible
10349 instructions same as
10352 except must generate
10353 all instructions even
10355 load atomic seq_cst - workgroup - local
10357 1. s_waitcnt vmcnt(0) & vscnt(0)
10359 - If CU wavefront execution
10361 - Could be split into
10363 vmcnt(0) and s_waitcnt
10366 independently moved
10369 - s_waitcnt vmcnt(0)
10372 global/generic load
10374 atomicrmw-with-return-value
10376 ordering of seq_cst
10385 - s_waitcnt vscnt(0)
10388 global/generic store
10390 atomicrmw-no-return-value
10392 ordering of seq_cst
10405 memory instructions
10411 prevents reordering
10414 seq_cst load. (Note
10420 followed by a store
10427 release followed by
10430 order. The s_waitcnt
10431 could be placed after
10432 seq_store or before
10435 make the s_waitcnt be
10436 as late as possible
10442 instructions same as
10445 except must generate
10446 all instructions even
10449 load atomic seq_cst - agent - global 1. s_waitcnt lgkmcnt(0) &
10450 - system - generic vmcnt(0) & vscnt(0)
10452 - Could be split into
10454 vmcnt(0), s_waitcnt
10455 vscnt(0) and s_waitcnt
10456 lgkmcnt(0) to allow
10458 independently moved
10461 - s_waitcnt lgkmcnt(0)
10468 ordering of seq_cst
10474 lgkmcnt(0) and so do
10477 - s_waitcnt vmcnt(0)
10480 global/generic load
10482 atomicrmw-with-return-value
10484 ordering of seq_cst
10493 - s_waitcnt vscnt(0)
10496 global/generic store
10498 atomicrmw-no-return-value
10500 ordering of seq_cst
10513 memory instructions
10519 prevents reordering
10522 seq_cst load. (Note
10528 followed by a store
10535 release followed by
10538 order. The s_waitcnt
10539 could be placed after
10540 seq_store or before
10543 make the s_waitcnt be
10544 as late as possible
10550 instructions same as
10553 except must generate
10554 all instructions even
10556 store atomic seq_cst - singlethread - global *Same as corresponding
10557 - wavefront - local store atomic release,
10558 - workgroup - generic except must generate
10559 - agent all instructions even
10560 - system for OpenCL.*
10561 atomicrmw seq_cst - singlethread - global *Same as corresponding
10562 - wavefront - local atomicrmw acq_rel,
10563 - workgroup - generic except must generate
10564 - agent all instructions even
10565 - system for OpenCL.*
10566 fence seq_cst - singlethread *none* *Same as corresponding
10567 - wavefront fence acq_rel,
10568 - workgroup except must generate
10569 - agent all instructions even
10570 - system for OpenCL.*
10571 ============ ============ ============== ========== ================================
10576 For code objects generated by the AMDGPU backend for HSA [HSA]_ compatible
10577 runtimes (see :ref:`amdgpu-os`), the runtime installs a trap handler that
10578 supports the ``s_trap`` instruction. For usage see:
10580 - :ref:`amdgpu-trap-handler-for-amdhsa-os-v2-table`
10581 - :ref:`amdgpu-trap-handler-for-amdhsa-os-v3-table`
10582 - :ref:`amdgpu-trap-handler-for-amdhsa-os-v4-table`
10584 .. table:: AMDGPU Trap Handler for AMDHSA OS Code Object V2
10585 :name: amdgpu-trap-handler-for-amdhsa-os-v2-table
10587 =================== =============== =============== =======================================
10588 Usage Code Sequence Trap Handler Description
10590 =================== =============== =============== =======================================
10591 reserved ``s_trap 0x00`` Reserved by hardware.
10592 ``debugtrap(arg)`` ``s_trap 0x01`` ``SGPR0-1``: Reserved for Finalizer HSA ``debugtrap``
10593 ``queue_ptr`` intrinsic (not implemented).
10596 ``llvm.trap`` ``s_trap 0x02`` ``SGPR0-1``: Causes wave to be halted with the PC at
10597 ``queue_ptr`` the trap instruction. The associated
10598 queue is signalled to put it into the
10599 error state. When the queue is put in
10600 the error state, the waves executing
10601 dispatches on the queue will be
10603 ``llvm.debugtrap`` ``s_trap 0x03`` *none* - If debugger not enabled then behaves
10604 as a no-operation. The trap handler
10605 is entered and immediately returns to
10606 continue execution of the wavefront.
10607 - If the debugger is enabled, causes
10608 the debug trap to be reported by the
10609 debugger and the wavefront is put in
10610 the halt state with the PC at the
10611 instruction. The debugger must
10612 increment the PC and resume the wave.
10613 reserved ``s_trap 0x04`` Reserved.
10614 reserved ``s_trap 0x05`` Reserved.
10615 reserved ``s_trap 0x06`` Reserved.
10616 reserved ``s_trap 0x07`` Reserved.
10617 reserved ``s_trap 0x08`` Reserved.
10618 reserved ``s_trap 0xfe`` Reserved.
10619 reserved ``s_trap 0xff`` Reserved.
10620 =================== =============== =============== =======================================
10624 .. table:: AMDGPU Trap Handler for AMDHSA OS Code Object V3
10625 :name: amdgpu-trap-handler-for-amdhsa-os-v3-table
10627 =================== =============== =============== =======================================
10628 Usage Code Sequence Trap Handler Description
10630 =================== =============== =============== =======================================
10631 reserved ``s_trap 0x00`` Reserved by hardware.
10632 debugger breakpoint ``s_trap 0x01`` *none* Reserved for debugger to use for
10633 breakpoints. Causes wave to be halted
10634 with the PC at the trap instruction.
10635 The debugger is responsible to resume
10636 the wave, including the instruction
10637 that the breakpoint overwrote.
10638 ``llvm.trap`` ``s_trap 0x02`` ``SGPR0-1``: Causes wave to be halted with the PC at
10639 ``queue_ptr`` the trap instruction. The associated
10640 queue is signalled to put it into the
10641 error state. When the queue is put in
10642 the error state, the waves executing
10643 dispatches on the queue will be
10645 ``llvm.debugtrap`` ``s_trap 0x03`` *none* - If debugger not enabled then behaves
10646 as a no-operation. The trap handler
10647 is entered and immediately returns to
10648 continue execution of the wavefront.
10649 - If the debugger is enabled, causes
10650 the debug trap to be reported by the
10651 debugger and the wavefront is put in
10652 the halt state with the PC at the
10653 instruction. The debugger must
10654 increment the PC and resume the wave.
10655 reserved ``s_trap 0x04`` Reserved.
10656 reserved ``s_trap 0x05`` Reserved.
10657 reserved ``s_trap 0x06`` Reserved.
10658 reserved ``s_trap 0x07`` Reserved.
10659 reserved ``s_trap 0x08`` Reserved.
10660 reserved ``s_trap 0xfe`` Reserved.
10661 reserved ``s_trap 0xff`` Reserved.
10662 =================== =============== =============== =======================================
10666 .. table:: AMDGPU Trap Handler for AMDHSA OS Code Object V4
10667 :name: amdgpu-trap-handler-for-amdhsa-os-v4-table
10669 =================== =============== ================ ================= =======================================
10670 Usage Code Sequence GFX6-GFX8 Inputs GFX9-GFX10 Inputs Description
10671 =================== =============== ================ ================= =======================================
10672 reserved ``s_trap 0x00`` Reserved by hardware.
10673 debugger breakpoint ``s_trap 0x01`` *none* *none* Reserved for debugger to use for
10674 breakpoints. Causes wave to be halted
10675 with the PC at the trap instruction.
10676 The debugger is responsible to resume
10677 the wave, including the instruction
10678 that the breakpoint overwrote.
10679 ``llvm.trap`` ``s_trap 0x02`` ``SGPR0-1``: *none* Causes wave to be halted with the PC at
10680 ``queue_ptr`` the trap instruction. The associated
10681 queue is signalled to put it into the
10682 error state. When the queue is put in
10683 the error state, the waves executing
10684 dispatches on the queue will be
10686 ``llvm.debugtrap`` ``s_trap 0x03`` *none* *none* - If debugger not enabled then behaves
10687 as a no-operation. The trap handler
10688 is entered and immediately returns to
10689 continue execution of the wavefront.
10690 - If the debugger is enabled, causes
10691 the debug trap to be reported by the
10692 debugger and the wavefront is put in
10693 the halt state with the PC at the
10694 instruction. The debugger must
10695 increment the PC and resume the wave.
10696 reserved ``s_trap 0x04`` Reserved.
10697 reserved ``s_trap 0x05`` Reserved.
10698 reserved ``s_trap 0x06`` Reserved.
10699 reserved ``s_trap 0x07`` Reserved.
10700 reserved ``s_trap 0x08`` Reserved.
10701 reserved ``s_trap 0xfe`` Reserved.
10702 reserved ``s_trap 0xff`` Reserved.
10703 =================== =============== ================ ================= =======================================
10705 .. _amdgpu-amdhsa-function-call-convention:
10712 This section is currently incomplete and has inaccuracies. It is WIP that will
10713 be updated as information is determined.
10715 See :ref:`amdgpu-dwarf-address-space-identifier` for information on swizzled
10716 addresses. Unswizzled addresses are normal linear addresses.
10718 .. _amdgpu-amdhsa-function-call-convention-kernel-functions:
10723 This section describes the call convention ABI for the outer kernel function.
10725 See :ref:`amdgpu-amdhsa-initial-kernel-execution-state` for the kernel call
10728 The following is not part of the AMDGPU kernel calling convention but describes
10729 how the AMDGPU implements function calls:
10731 1. Clang decides the kernarg layout to match the *HSA Programmer's Language
10734 - All structs are passed directly.
10735 - Lambda values are passed *TBA*.
10739 - Does this really follow HSA rules? Or are structs >16 bytes passed
10741 - What is ABI for lambda values?
10743 4. The kernel performs certain setup in its prolog, as described in
10744 :ref:`amdgpu-amdhsa-kernel-prolog`.
10746 .. _amdgpu-amdhsa-function-call-convention-non-kernel-functions:
10748 Non-Kernel Functions
10749 ++++++++++++++++++++
10751 This section describes the call convention ABI for functions other than the
10752 outer kernel function.
10754 If a kernel has function calls then scratch is always allocated and used for
10755 the call stack which grows from low address to high address using the swizzled
10756 scratch address space.
10758 On entry to a function:
10760 1. SGPR0-3 contain a V# with the following properties (see
10761 :ref:`amdgpu-amdhsa-kernel-prolog-private-segment-buffer`):
10763 * Base address pointing to the beginning of the wavefront scratch backing
10765 * Swizzled with dword element size and stride of wavefront size elements.
10767 2. The FLAT_SCRATCH register pair is setup. See
10768 :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`.
10769 3. GFX6-GFX8: M0 register set to the size of LDS in bytes. See
10770 :ref:`amdgpu-amdhsa-kernel-prolog-m0`.
10771 4. The EXEC register is set to the lanes active on entry to the function.
10772 5. MODE register: *TBD*
10773 6. VGPR0-31 and SGPR4-29 are used to pass function input arguments as described
10775 7. SGPR30-31 return address (RA). The code address that the function must
10776 return to when it completes. The value is undefined if the function is *no
10778 8. SGPR32 is used for the stack pointer (SP). It is an unswizzled scratch
10779 offset relative to the beginning of the wavefront scratch backing memory.
10781 The unswizzled SP can be used with buffer instructions as an unswizzled SGPR
10782 offset with the scratch V# in SGPR0-3 to access the stack in a swizzled
10785 The unswizzled SP value can be converted into the swizzled SP value by:
10787 | swizzled SP = unswizzled SP / wavefront size
10789 This may be used to obtain the private address space address of stack
10790 objects and to convert this address to a flat address by adding the flat
10791 scratch aperture base address.
10793 The swizzled SP value is always 4 bytes aligned for the ``r600``
10794 architecture and 16 byte aligned for the ``amdgcn`` architecture.
10798 The ``amdgcn`` value is selected to avoid dynamic stack alignment for the
10799 OpenCL language which has the largest base type defined as 16 bytes.
10801 On entry, the swizzled SP value is the address of the first function
10802 argument passed on the stack. Other stack passed arguments are positive
10803 offsets from the entry swizzled SP value.
10805 The function may use positive offsets beyond the last stack passed argument
10806 for stack allocated local variables and register spill slots. If necessary,
10807 the function may align these to greater alignment than 16 bytes. After these
10808 the function may dynamically allocate space for such things as runtime sized
10809 ``alloca`` local allocations.
10811 If the function calls another function, it will place any stack allocated
10812 arguments after the last local allocation and adjust SGPR32 to the address
10813 after the last local allocation.
10815 9. All other registers are unspecified.
10816 10. Any necessary ``s_waitcnt`` has been performed to ensure memory is available
10819 On exit from a function:
10821 1. VGPR0-31 and SGPR4-29 are used to pass function result arguments as
10822 described below. Any registers used are considered clobbered registers.
10823 2. The following registers are preserved and have the same value as on entry:
10828 * All SGPR registers except the clobbered registers of SGPR4-31.
10846 Except the argument registers, the VGPRs clobbered and the preserved
10847 registers are intermixed at regular intervals in order to keep a
10848 similar ratio independent of the number of allocated VGPRs.
10850 * GFX90A: All AGPR registers except the clobbered registers AGPR0-31.
10851 * Lanes of all VGPRs that are inactive at the call site.
10853 For the AMDGPU backend, an inter-procedural register allocation (IPRA)
10854 optimization may mark some of clobbered SGPR and VGPR registers as
10855 preserved if it can be determined that the called function does not change
10858 2. The PC is set to the RA provided on entry.
10859 3. MODE register: *TBD*.
10860 4. All other registers are clobbered.
10861 5. Any necessary ``s_waitcnt`` has been performed to ensure memory accessed by
10862 function is available to the caller.
10866 - How are function results returned? The address of structured types is passed
10867 by reference, but what about other types?
10869 The function input arguments are made up of the formal arguments explicitly
10870 declared by the source language function plus the implicit input arguments used
10871 by the implementation.
10873 The source language input arguments are:
10875 1. Any source language implicit ``this`` or ``self`` argument comes first as a
10877 2. Followed by the function formal arguments in left to right source order.
10879 The source language result arguments are:
10881 1. The function result argument.
10883 The source language input or result struct type arguments that are less than or
10884 equal to 16 bytes, are decomposed recursively into their base type fields, and
10885 each field is passed as if a separate argument. For input arguments, if the
10886 called function requires the struct to be in memory, for example because its
10887 address is taken, then the function body is responsible for allocating a stack
10888 location and copying the field arguments into it. Clang terms this *direct
10891 The source language input struct type arguments that are greater than 16 bytes,
10892 are passed by reference. The caller is responsible for allocating a stack
10893 location to make a copy of the struct value and pass the address as the input
10894 argument. The called function is responsible to perform the dereference when
10895 accessing the input argument. Clang terms this *by-value struct*.
10897 A source language result struct type argument that is greater than 16 bytes, is
10898 returned by reference. The caller is responsible for allocating a stack location
10899 to hold the result value and passes the address as the last input argument
10900 (before the implicit input arguments). In this case there are no result
10901 arguments. The called function is responsible to perform the dereference when
10902 storing the result value. Clang terms this *structured return (sret)*.
10904 *TODO: correct the ``sret`` definition.*
10908 Is this definition correct? Or is ``sret`` only used if passing in registers, and
10909 pass as non-decomposed struct as stack argument? Or something else? Is the
10910 memory location in the caller stack frame, or a stack memory argument and so
10911 no address is passed as the caller can directly write to the argument stack
10912 location? But then the stack location is still live after return. If an
10913 argument stack location is it the first stack argument or the last one?
10915 Lambda argument types are treated as struct types with an implementation defined
10920 Need to specify the ABI for lambda types for AMDGPU.
10922 For AMDGPU backend all source language arguments (including the decomposed
10923 struct type arguments) are passed in VGPRs unless marked ``inreg`` in which case
10924 they are passed in SGPRs.
10926 The AMDGPU backend walks the function call graph from the leaves to determine
10927 which implicit input arguments are used, propagating to each caller of the
10928 function. The used implicit arguments are appended to the function arguments
10929 after the source language arguments in the following order:
10933 Is recursion or external functions supported?
10935 1. Work-Item ID (1 VGPR)
10937 The X, Y and Z work-item ID are packed into a single VGRP with the following
10938 layout. Only fields actually used by the function are set. The other bits
10941 The values come from the initial kernel execution state. See
10942 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`.
10944 .. table:: Work-item implicit argument layout
10945 :name: amdgpu-amdhsa-workitem-implicit-argument-layout-table
10947 ======= ======= ==============
10948 Bits Size Field Name
10949 ======= ======= ==============
10950 9:0 10 bits X Work-Item ID
10951 19:10 10 bits Y Work-Item ID
10952 29:20 10 bits Z Work-Item ID
10953 31:30 2 bits Unused
10954 ======= ======= ==============
10956 2. Dispatch Ptr (2 SGPRs)
10958 The value comes from the initial kernel execution state. See
10959 :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
10961 3. Queue Ptr (2 SGPRs)
10963 The value comes from the initial kernel execution state. See
10964 :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
10966 4. Kernarg Segment Ptr (2 SGPRs)
10968 The value comes from the initial kernel execution state. See
10969 :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
10971 5. Dispatch id (2 SGPRs)
10973 The value comes from the initial kernel execution state. See
10974 :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
10976 6. Work-Group ID X (1 SGPR)
10978 The value comes from the initial kernel execution state. See
10979 :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
10981 7. Work-Group ID Y (1 SGPR)
10983 The value comes from the initial kernel execution state. See
10984 :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
10986 8. Work-Group ID Z (1 SGPR)
10988 The value comes from the initial kernel execution state. See
10989 :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
10991 9. Implicit Argument Ptr (2 SGPRs)
10993 The value is computed by adding an offset to Kernarg Segment Ptr to get the
10994 global address space pointer to the first kernarg implicit argument.
10996 The input and result arguments are assigned in order in the following manner:
11000 There are likely some errors and omissions in the following description that
11005 Check the Clang source code to decipher how function arguments and return
11006 results are handled. Also see the AMDGPU specific values used.
11008 * VGPR arguments are assigned to consecutive VGPRs starting at VGPR0 up to
11011 If there are more arguments than will fit in these registers, the remaining
11012 arguments are allocated on the stack in order on naturally aligned
11017 How are overly aligned structures allocated on the stack?
11019 * SGPR arguments are assigned to consecutive SGPRs starting at SGPR0 up to
11022 If there are more arguments than will fit in these registers, the remaining
11023 arguments are allocated on the stack in order on naturally aligned
11026 Note that decomposed struct type arguments may have some fields passed in
11027 registers and some in memory.
11031 So, a struct which can pass some fields as decomposed register arguments, will
11032 pass the rest as decomposed stack elements? But an argument that will not start
11033 in registers will not be decomposed and will be passed as a non-decomposed
11036 The following is not part of the AMDGPU function calling convention but
11037 describes how the AMDGPU implements function calls:
11039 1. SGPR33 is used as a frame pointer (FP) if necessary. Like the SP it is an
11040 unswizzled scratch address. It is only needed if runtime sized ``alloca``
11041 are used, or for the reasons defined in ``SIFrameLowering``.
11042 2. Runtime stack alignment is supported. SGPR34 is used as a base pointer (BP)
11043 to access the incoming stack arguments in the function. The BP is needed
11044 only when the function requires the runtime stack alignment.
11046 3. Allocating SGPR arguments on the stack are not supported.
11048 4. No CFI is currently generated. See
11049 :ref:`amdgpu-dwarf-call-frame-information`.
11053 CFI will be generated that defines the CFA as the unswizzled address
11054 relative to the wave scratch base in the unswizzled private address space
11055 of the lowest address stack allocated local variable.
11057 ``DW_AT_frame_base`` will be defined as the swizzled address in the
11058 swizzled private address space by dividing the CFA by the wavefront size
11059 (since CFA is always at least dword aligned which matches the scratch
11060 swizzle element size).
11062 If no dynamic stack alignment was performed, the stack allocated arguments
11063 are accessed as negative offsets relative to ``DW_AT_frame_base``, and the
11064 local variables and register spill slots are accessed as positive offsets
11065 relative to ``DW_AT_frame_base``.
11067 5. Function argument passing is implemented by copying the input physical
11068 registers to virtual registers on entry. The register allocator can spill if
11069 necessary. These are copied back to physical registers at call sites. The
11070 net effect is that each function call can have these values in entirely
11071 distinct locations. The IPRA can help avoid shuffling argument registers.
11072 6. Call sites are implemented by setting up the arguments at positive offsets
11073 from SP. Then SP is incremented to account for the known frame size before
11074 the call and decremented after the call.
11078 The CFI will reflect the changed calculation needed to compute the CFA
11081 7. 4 byte spill slots are used in the stack frame. One slot is allocated for an
11082 emergency spill slot. Buffer instructions are used for stack accesses and
11083 not the ``flat_scratch`` instruction.
11087 Explain when the emergency spill slot is used.
11091 Possible broken issues:
11093 - Stack arguments must be aligned to required alignment.
11094 - Stack is aligned to max(16, max formal argument alignment)
11095 - Direct argument < 64 bits should check register budget.
11096 - Register budget calculation should respect ``inreg`` for SGPR.
11097 - SGPR overflow is not handled.
11098 - struct with 1 member unpeeling is not checking size of member.
11099 - ``sret`` is after ``this`` pointer.
11100 - Caller is not implementing stack realignment: need an extra pointer.
11101 - Should say AMDGPU passes FP rather than SP.
11102 - Should CFI define CFA as address of locals or arguments. Difference is
11103 apparent when have implemented dynamic alignment.
11104 - If ``SCRATCH`` instruction could allow negative offsets, then can make FP be
11105 highest address of stack frame and use negative offset for locals. Would
11106 allow SP to be the same as FP and could support signal-handler-like as now
11107 have a real SP for the top of the stack.
11108 - How is ``sret`` passed on the stack? In argument stack area? Can it overlay
11114 This section provides code conventions used when the target triple OS is
11115 ``amdpal`` (see :ref:`amdgpu-target-triples`).
11117 .. _amdgpu-amdpal-code-object-metadata-section:
11119 Code Object Metadata
11120 ~~~~~~~~~~~~~~~~~~~~
11124 The metadata is currently in development and is subject to major
11125 changes. Only the current version is supported. *When this document
11126 was generated the version was 2.6.*
11128 Code object metadata is specified by the ``NT_AMDGPU_METADATA`` note
11129 record (see :ref:`amdgpu-note-records-v3-v4`).
11131 The metadata is represented as Message Pack formatted binary data (see
11132 [MsgPack]_). The top level is a Message Pack map that includes the keys
11133 defined in table :ref:`amdgpu-amdpal-code-object-metadata-map-table`
11134 and referenced tables.
11136 Additional information can be added to the maps. To avoid conflicts, any
11137 key names should be prefixed by "*vendor-name*." where ``vendor-name``
11138 can be the name of the vendor and specific vendor tool that generates the
11139 information. The prefix is abbreviated to simply "." when it appears
11140 within a map that has been added by the same *vendor-name*.
11142 .. table:: AMDPAL Code Object Metadata Map
11143 :name: amdgpu-amdpal-code-object-metadata-map-table
11145 =================== ============== ========= ======================================================================
11146 String Key Value Type Required? Description
11147 =================== ============== ========= ======================================================================
11148 "amdpal.version" sequence of Required PAL code object metadata (major, minor) version. The current values
11149 2 integers are defined by *Util::Abi::PipelineMetadata(Major|Minor)Version*.
11150 "amdpal.pipelines" sequence of Required Per-pipeline metadata. See
11151 map :ref:`amdgpu-amdpal-code-object-pipeline-metadata-map-table` for the
11152 definition of the keys included in that map.
11153 =================== ============== ========= ======================================================================
11157 .. table:: AMDPAL Code Object Pipeline Metadata Map
11158 :name: amdgpu-amdpal-code-object-pipeline-metadata-map-table
11160 ====================================== ============== ========= ===================================================
11161 String Key Value Type Required? Description
11162 ====================================== ============== ========= ===================================================
11163 ".name" string Source name of the pipeline.
11164 ".type" string Pipeline type, e.g. VsPs. Values include:
11174 ".internal_pipeline_hash" sequence of Required Internal compiler hash for this pipeline. Lower
11175 2 integers 64 bits is the "stable" portion of the hash, used
11176 for e.g. shader replacement lookup. Upper 64 bits
11177 is the "unique" portion of the hash, used for
11178 e.g. pipeline cache lookup. The value is
11179 implementation defined, and can not be relied on
11180 between different builds of the compiler.
11181 ".shaders" map Per-API shader metadata. See
11182 :ref:`amdgpu-amdpal-code-object-shader-map-table`
11183 for the definition of the keys included in that
11185 ".hardware_stages" map Per-hardware stage metadata. See
11186 :ref:`amdgpu-amdpal-code-object-hardware-stage-map-table`
11187 for the definition of the keys included in that
11189 ".shader_functions" map Per-shader function metadata. See
11190 :ref:`amdgpu-amdpal-code-object-shader-function-map-table`
11191 for the definition of the keys included in that
11193 ".registers" map Required Hardware register configuration. See
11194 :ref:`amdgpu-amdpal-code-object-register-map-table`
11195 for the definition of the keys included in that
11197 ".user_data_limit" integer Number of user data entries accessed by this
11199 ".spill_threshold" integer The user data spill threshold. 0xFFFF for
11200 NoUserDataSpilling.
11201 ".uses_viewport_array_index" boolean Indicates whether or not the pipeline uses the
11202 viewport array index feature. Pipelines which use
11203 this feature can render into all 16 viewports,
11204 whereas pipelines which do not use it are
11205 restricted to viewport #0.
11206 ".es_gs_lds_size" integer Size in bytes of LDS space used internally for
11207 handling data-passing between the ES and GS
11208 shader stages. This can be zero if the data is
11209 passed using off-chip buffers. This value should
11210 be used to program all user-SGPRs which have been
11211 marked with "UserDataMapping::EsGsLdsSize"
11212 (typically only the GS and VS HW stages will ever
11213 have a user-SGPR so marked).
11214 ".nggSubgroupSize" integer Explicit maximum subgroup size for NGG shaders
11215 (maximum number of threads in a subgroup).
11216 ".num_interpolants" integer Graphics only. Number of PS interpolants.
11217 ".mesh_scratch_memory_size" integer Max mesh shader scratch memory used.
11218 ".api" string Name of the client graphics API.
11219 ".api_create_info" binary Graphics API shader create info binary blob. Can
11220 be defined by the driver using the compiler if
11221 they want to be able to correlate API-specific
11222 information used during creation at a later time.
11223 ====================================== ============== ========= ===================================================
11227 .. table:: AMDPAL Code Object Shader Map
11228 :name: amdgpu-amdpal-code-object-shader-map-table
11231 +-------------+--------------+-------------------------------------------------------------------+
11232 |String Key |Value Type |Description |
11233 +=============+==============+===================================================================+
11234 |- ".compute" |map |See :ref:`amdgpu-amdpal-code-object-api-shader-metadata-map-table` |
11235 |- ".vertex" | |for the definition of the keys included in that map. |
11238 |- ".geometry"| | |
11240 +-------------+--------------+-------------------------------------------------------------------+
11244 .. table:: AMDPAL Code Object API Shader Metadata Map
11245 :name: amdgpu-amdpal-code-object-api-shader-metadata-map-table
11247 ==================== ============== ========= =====================================================================
11248 String Key Value Type Required? Description
11249 ==================== ============== ========= =====================================================================
11250 ".api_shader_hash" sequence of Required Input shader hash, typically passed in from the client. The value
11251 2 integers is implementation defined, and can not be relied on between
11252 different builds of the compiler.
11253 ".hardware_mapping" sequence of Required Flags indicating the HW stages this API shader maps to. Values
11264 ==================== ============== ========= =====================================================================
11268 .. table:: AMDPAL Code Object Hardware Stage Map
11269 :name: amdgpu-amdpal-code-object-hardware-stage-map-table
11271 +-------------+--------------+-----------------------------------------------------------------------+
11272 |String Key |Value Type |Description |
11273 +=============+==============+=======================================================================+
11274 |- ".ls" |map |See :ref:`amdgpu-amdpal-code-object-hardware-stage-metadata-map-table` |
11275 |- ".hs" | |for the definition of the keys included in that map. |
11281 +-------------+--------------+-----------------------------------------------------------------------+
11285 .. table:: AMDPAL Code Object Hardware Stage Metadata Map
11286 :name: amdgpu-amdpal-code-object-hardware-stage-metadata-map-table
11288 ========================== ============== ========= ===============================================================
11289 String Key Value Type Required? Description
11290 ========================== ============== ========= ===============================================================
11291 ".entry_point" string The ELF symbol pointing to this pipeline's stage entry point.
11292 ".scratch_memory_size" integer Scratch memory size in bytes.
11293 ".lds_size" integer Local Data Share size in bytes.
11294 ".perf_data_buffer_size" integer Performance data buffer size in bytes.
11295 ".vgpr_count" integer Number of VGPRs used.
11296 ".sgpr_count" integer Number of SGPRs used.
11297 ".vgpr_limit" integer If non-zero, indicates the shader was compiled with a
11298 directive to instruct the compiler to limit the VGPR usage to
11299 be less than or equal to the specified value (only set if
11300 different from HW default).
11301 ".sgpr_limit" integer SGPR count upper limit (only set if different from HW
11303 ".threadgroup_dimensions" sequence of Thread-group X/Y/Z dimensions (Compute only).
11305 ".wavefront_size" integer Wavefront size (only set if different from HW default).
11306 ".uses_uavs" boolean The shader reads or writes UAVs.
11307 ".uses_rovs" boolean The shader reads or writes ROVs.
11308 ".writes_uavs" boolean The shader writes to one or more UAVs.
11309 ".writes_depth" boolean The shader writes out a depth value.
11310 ".uses_append_consume" boolean The shader uses append and/or consume operations, either
11312 ".uses_prim_id" boolean The shader uses PrimID.
11313 ========================== ============== ========= ===============================================================
11317 .. table:: AMDPAL Code Object Shader Function Map
11318 :name: amdgpu-amdpal-code-object-shader-function-map-table
11320 =============== ============== ====================================================================
11321 String Key Value Type Description
11322 =============== ============== ====================================================================
11323 *symbol name* map *symbol name* is the ELF symbol name of the shader function code
11324 entry address. The value is the function's metadata. See
11325 :ref:`amdgpu-amdpal-code-object-shader-function-metadata-map-table`.
11326 =============== ============== ====================================================================
11330 .. table:: AMDPAL Code Object Shader Function Metadata Map
11331 :name: amdgpu-amdpal-code-object-shader-function-metadata-map-table
11333 ============================= ============== =================================================================
11334 String Key Value Type Description
11335 ============================= ============== =================================================================
11336 ".api_shader_hash" sequence of Input shader hash, typically passed in from the client. The value
11337 2 integers is implementation defined, and can not be relied on between
11338 different builds of the compiler.
11339 ".scratch_memory_size" integer Size in bytes of scratch memory used by the shader.
11340 ".lds_size" integer Size in bytes of LDS memory.
11341 ".vgpr_count" integer Number of VGPRs used by the shader.
11342 ".sgpr_count" integer Number of SGPRs used by the shader.
11343 ".stack_frame_size_in_bytes" integer Amount of stack size used by the shader.
11344 ".shader_subtype" string Shader subtype/kind. Values include:
11348 ============================= ============== =================================================================
11352 .. table:: AMDPAL Code Object Register Map
11353 :name: amdgpu-amdpal-code-object-register-map-table
11355 ========================== ============== ====================================================================
11356 32-bit Integer Key Value Type Description
11357 ========================== ============== ====================================================================
11358 ``reg offset`` 32-bit integer ``reg offset`` is the dword offset into the GFXIP register space of
11359 a GRBM register (i.e., driver accessible GPU register number, not
11360 shader GPR register number). The driver is required to program each
11361 specified register to the corresponding specified value when
11362 executing this pipeline. Typically, the ``reg offsets`` are the
11363 ``uint16_t`` offsets to each register as defined by the hardware
11364 chip headers. The register is set to the provided value. However, a
11365 ``reg offset`` that specifies a user data register (e.g.,
11366 COMPUTE_USER_DATA_0) needs special treatment. See
11367 :ref:`amdgpu-amdpal-code-object-user-data-section` section for more
11369 ========================== ============== ====================================================================
11371 .. _amdgpu-amdpal-code-object-user-data-section:
11376 Each hardware stage has a set of 32-bit physical SPI *user data registers*
11377 (either 16 or 32 based on graphics IP and the stage) which can be
11378 written from a command buffer and then loaded into SGPRs when waves are
11379 launched via a subsequent dispatch or draw operation. This is the way
11380 most arguments are passed from the application/runtime to a hardware
11383 PAL abstracts this functionality by exposing a set of 128 *user data
11384 entries* per pipeline a client can use to pass arguments from a command
11385 buffer to one or more shaders in that pipeline. The ELF code object must
11386 specify a mapping from virtualized *user data entries* to physical *user
11387 data registers*, and PAL is responsible for implementing that mapping,
11388 including spilling overflow *user data entries* to memory if needed.
11390 Since the *user data registers* are GRBM-accessible SPI registers, this
11391 mapping is actually embedded in the ``.registers`` metadata entry. For
11392 most registers, the value in that map is a literal 32-bit value that
11393 should be written to the register by the driver. However, when the
11394 register is a *user data register* (any USER_DATA register e.g.,
11395 SPI_SHADER_USER_DATA_PS_5), the value is instead an encoding that tells
11396 the driver to write either a *user data entry* value or one of several
11397 driver-internal values to the register. This encoding is described in
11398 the following table:
11402 Currently, *user data registers* 0 and 1 (e.g., SPI_SHADER_USER_DATA_PS_0,
11403 and SPI_SHADER_USER_DATA_PS_1) are reserved. *User data register* 0 must
11404 always be programmed to the address of the GlobalTable, and *user data
11405 register* 1 must always be programmed to the address of the PerShaderTable.
11409 .. table:: AMDPAL User Data Mapping
11410 :name: amdgpu-amdpal-code-object-metadata-user-data-mapping-table
11412 ========== ================= ===============================================================================
11413 Value Name Description
11414 ========== ================= ===============================================================================
11415 0..127 *User Data Entry* 32-bit value of user_data_entry[N] as specified via *CmdSetUserData()*
11416 0x10000000 GlobalTable 32-bit pointer to GPU memory containing the global internal table (should
11417 always point to *user data register* 0).
11418 0x10000001 PerShaderTable 32-bit pointer to GPU memory containing the per-shader internal table. See
11419 :ref:`amdgpu-amdpal-code-object-metadata-user-data-per-shader-table-section`
11420 for more detail (should always point to *user data register* 1).
11421 0x10000002 SpillTable 32-bit pointer to GPU memory containing the user data spill table. See
11422 :ref:`amdgpu-amdpal-code-object-metadata-user-data-spill-table-section` for
11424 0x10000003 BaseVertex Vertex offset (32-bit unsigned integer). Not needed if the pipeline doesn't
11425 reference the draw index in the vertex shader. Only supported by the first
11426 stage in a graphics pipeline.
11427 0x10000004 BaseInstance Instance offset (32-bit unsigned integer). Only supported by the first stage in
11428 a graphics pipeline.
11429 0x10000005 DrawIndex Draw index (32-bit unsigned integer). Only supported by the first stage in a
11431 0x10000006 Workgroup Thread group count (32-bit unsigned integer). Low half of a 64-bit address of
11432 a buffer containing the grid dimensions for a Compute dispatch operation. The
11433 high half of the address is stored in the next sequential user-SGPR. Only
11434 supported by compute pipelines.
11435 0x1000000A EsGsLdsSize Indicates that PAL will program this user-SGPR to contain the amount of LDS
11436 space used for the ES/GS pseudo-ring-buffer for passing data between shader
11438 0x1000000B ViewId View id (32-bit unsigned integer) identifies a view of graphic
11439 pipeline instancing.
11440 0x1000000C StreamOutTable 32-bit pointer to GPU memory containing the stream out target SRD table. This
11441 can only appear for one shader stage per pipeline.
11442 0x1000000D PerShaderPerfData 32-bit pointer to GPU memory containing the per-shader performance data buffer.
11443 0x1000000F VertexBufferTable 32-bit pointer to GPU memory containing the vertex buffer SRD table. This can
11444 only appear for one shader stage per pipeline.
11445 0x10000010 UavExportTable 32-bit pointer to GPU memory containing the UAV export SRD table. This can
11446 only appear for one shader stage per pipeline (PS). These replace color targets
11447 and are completely separate from any UAVs used by the shader. This is optional,
11448 and only used by the PS when UAV exports are used to replace color-target
11449 exports to optimize specific shaders.
11450 0x10000011 NggCullingData 64-bit pointer to GPU memory containing the hardware register data needed by
11451 some NGG pipelines to perform culling. This value contains the address of the
11452 first of two consecutive registers which provide the full GPU address.
11453 0x10000015 FetchShaderPtr 64-bit pointer to GPU memory containing the fetch shader subroutine.
11454 ========== ================= ===============================================================================
11456 .. _amdgpu-amdpal-code-object-metadata-user-data-per-shader-table-section:
11461 Low 32 bits of the GPU address for an optional buffer in the ``.data``
11462 section of the ELF. The high 32 bits of the address match the high 32 bits
11463 of the shader's program counter.
11465 The buffer can be anything the shader compiler needs it for, and
11466 allows each shader to have its own region of the ``.data`` section.
11467 Typically, this could be a table of buffer SRD's and the data pointed to
11468 by the buffer SRD's, but it could be a flat-address region of memory as
11469 well. Its layout and usage are defined by the shader compiler.
11471 Each shader's table in the ``.data`` section is referenced by the symbol
11472 ``_amdgpu_``\ *xs*\ ``_shdr_intrl_data`` where *xs* corresponds with the
11473 hardware shader stage the data is for. E.g.,
11474 ``_amdgpu_cs_shdr_intrl_data`` for the compute shader hardware stage.
11476 .. _amdgpu-amdpal-code-object-metadata-user-data-spill-table-section:
11481 It is possible for a hardware shader to need access to more *user data
11482 entries* than there are slots available in user data registers for one
11483 or more hardware shader stages. In that case, the PAL runtime expects
11484 the necessary *user data entries* to be spilled to GPU memory and use
11485 one user data register to point to the spilled user data memory. The
11486 value of the *user data entry* must then represent the location where
11487 a shader expects to read the low 32-bits of the table's GPU virtual
11488 address. The *spill table* itself represents a set of 32-bit values
11489 managed by the PAL runtime in GPU-accessible memory that can be made
11490 indirectly accessible to a hardware shader.
11495 This section provides code conventions used when the target triple OS is
11496 empty (see :ref:`amdgpu-target-triples`).
11501 For code objects generated by AMDGPU backend for non-amdhsa OS, the runtime does
11502 not install a trap handler. The ``llvm.trap`` and ``llvm.debugtrap``
11503 instructions are handled as follows:
11505 .. table:: AMDGPU Trap Handler for Non-AMDHSA OS
11506 :name: amdgpu-trap-handler-for-non-amdhsa-os-table
11508 =============== =============== ===========================================
11509 Usage Code Sequence Description
11510 =============== =============== ===========================================
11511 llvm.trap s_endpgm Causes wavefront to be terminated.
11512 llvm.debugtrap *none* Compiler warning given that there is no
11513 trap handler installed.
11514 =============== =============== ===========================================
11524 When the language is OpenCL the following differences occur:
11526 1. The OpenCL memory model is used (see :ref:`amdgpu-amdhsa-memory-model`).
11527 2. The AMDGPU backend appends additional arguments to the kernel's explicit
11528 arguments for the AMDHSA OS (see
11529 :ref:`opencl-kernel-implicit-arguments-appended-for-amdhsa-os-table`).
11530 3. Additional metadata is generated
11531 (see :ref:`amdgpu-amdhsa-code-object-metadata`).
11533 .. table:: OpenCL kernel implicit arguments appended for AMDHSA OS
11534 :name: opencl-kernel-implicit-arguments-appended-for-amdhsa-os-table
11536 ======== ==== ========= ===========================================
11537 Position Byte Byte Description
11539 ======== ==== ========= ===========================================
11540 1 8 8 OpenCL Global Offset X
11541 2 8 8 OpenCL Global Offset Y
11542 3 8 8 OpenCL Global Offset Z
11543 4 8 8 OpenCL address of printf buffer
11544 5 8 8 OpenCL address of virtual queue used by
11546 6 8 8 OpenCL address of AqlWrap struct used by
11548 7 8 8 Pointer argument used for Multi-gird
11550 ======== ==== ========= ===========================================
11557 When the language is HCC the following differences occur:
11559 1. The HSA memory model is used (see :ref:`amdgpu-amdhsa-memory-model`).
11561 .. _amdgpu-assembler:
11566 AMDGPU backend has LLVM-MC based assembler which is currently in development.
11567 It supports AMDGCN GFX6-GFX10.
11569 This section describes general syntax for instructions and operands.
11574 An instruction has the following :doc:`syntax<AMDGPUInstructionSyntax>`:
11576 | ``<``\ *opcode*\ ``> <``\ *operand0*\ ``>, <``\ *operand1*\ ``>,...
11577 <``\ *modifier0*\ ``> <``\ *modifier1*\ ``>...``
11579 :doc:`Operands<AMDGPUOperandSyntax>` are comma-separated while
11580 :doc:`modifiers<AMDGPUModifierSyntax>` are space-separated.
11582 The order of operands and modifiers is fixed.
11583 Most modifiers are optional and may be omitted.
11585 Links to detailed instruction syntax description may be found in the following
11586 table. Note that features under development are not included
11587 in this description.
11589 =================================== =======================================
11590 Core ISA ISA Extensions
11591 =================================== =======================================
11592 :doc:`GFX7<AMDGPU/AMDGPUAsmGFX7>` \-
11593 :doc:`GFX8<AMDGPU/AMDGPUAsmGFX8>` \-
11594 :doc:`GFX9<AMDGPU/AMDGPUAsmGFX9>` :doc:`gfx900<AMDGPU/AMDGPUAsmGFX900>`
11596 :doc:`gfx902<AMDGPU/AMDGPUAsmGFX900>`
11598 :doc:`gfx904<AMDGPU/AMDGPUAsmGFX904>`
11600 :doc:`gfx906<AMDGPU/AMDGPUAsmGFX906>`
11602 :doc:`gfx908<AMDGPU/AMDGPUAsmGFX908>`
11604 :doc:`gfx909<AMDGPU/AMDGPUAsmGFX900>`
11606 :doc:`gfx90a<AMDGPU/AMDGPUAsmGFX90a>`
11608 :doc:`GFX10<AMDGPU/AMDGPUAsmGFX10>` :doc:`gfx1011<AMDGPU/AMDGPUAsmGFX1011>`
11610 :doc:`gfx1012<AMDGPU/AMDGPUAsmGFX1011>`
11611 =================================== =======================================
11613 For more information about instructions, their semantics and supported
11614 combinations of operands, refer to one of instruction set architecture manuals
11615 [AMD-GCN-GFX6]_, [AMD-GCN-GFX7]_, [AMD-GCN-GFX8]_,
11616 [AMD-GCN-GFX900-GFX904-VEGA]_, [AMD-GCN-GFX906-VEGA7NM]_
11617 [AMD-GCN-GFX908-CDNA1]_, [AMD-GCN-GFX10-RDNA1]_ and [AMD-GCN-GFX10-RDNA2]_.
11622 Detailed description of operands may be found :doc:`here<AMDGPUOperandSyntax>`.
11627 Detailed description of modifiers may be found
11628 :doc:`here<AMDGPUModifierSyntax>`.
11630 Instruction Examples
11631 ~~~~~~~~~~~~~~~~~~~~
11636 .. code-block:: nasm
11638 ds_add_u32 v2, v4 offset:16
11639 ds_write_src2_b64 v2 offset0:4 offset1:8
11640 ds_cmpst_f32 v2, v4, v6
11641 ds_min_rtn_f64 v[8:9], v2, v[4:5]
11643 For full list of supported instructions, refer to "LDS/GDS instructions" in ISA
11649 .. code-block:: nasm
11651 flat_load_dword v1, v[3:4]
11652 flat_store_dwordx3 v[3:4], v[5:7]
11653 flat_atomic_swap v1, v[3:4], v5 glc
11654 flat_atomic_cmpswap v1, v[3:4], v[5:6] glc slc
11655 flat_atomic_fmax_x2 v[1:2], v[3:4], v[5:6] glc
11657 For full list of supported instructions, refer to "FLAT instructions" in ISA
11663 .. code-block:: nasm
11665 buffer_load_dword v1, off, s[4:7], s1
11666 buffer_store_dwordx4 v[1:4], v2, ttmp[4:7], s1 offen offset:4 glc tfe
11667 buffer_store_format_xy v[1:2], off, s[4:7], s1
11669 buffer_atomic_inc v1, v2, s[8:11], s4 idxen offset:4 slc
11671 For full list of supported instructions, refer to "MUBUF Instructions" in ISA
11677 .. code-block:: nasm
11679 s_load_dword s1, s[2:3], 0xfc
11680 s_load_dwordx8 s[8:15], s[2:3], s4
11681 s_load_dwordx16 s[88:103], s[2:3], s4
11685 For full list of supported instructions, refer to "Scalar Memory Operations" in
11691 .. code-block:: nasm
11694 s_mov_b64 s[0:1], 0x80000000
11696 s_wqm_b64 s[2:3], s[4:5]
11697 s_bcnt0_i32_b64 s1, s[2:3]
11698 s_swappc_b64 s[2:3], s[4:5]
11699 s_cbranch_join s[4:5]
11701 For full list of supported instructions, refer to "SOP1 Instructions" in ISA
11707 .. code-block:: nasm
11709 s_add_u32 s1, s2, s3
11710 s_and_b64 s[2:3], s[4:5], s[6:7]
11711 s_cselect_b32 s1, s2, s3
11712 s_andn2_b32 s2, s4, s6
11713 s_lshr_b64 s[2:3], s[4:5], s6
11714 s_ashr_i32 s2, s4, s6
11715 s_bfm_b64 s[2:3], s4, s6
11716 s_bfe_i64 s[2:3], s[4:5], s6
11717 s_cbranch_g_fork s[4:5], s[6:7]
11719 For full list of supported instructions, refer to "SOP2 Instructions" in ISA
11725 .. code-block:: nasm
11727 s_cmp_eq_i32 s1, s2
11728 s_bitcmp1_b32 s1, s2
11729 s_bitcmp0_b64 s[2:3], s4
11732 For full list of supported instructions, refer to "SOPC Instructions" in ISA
11738 .. code-block:: nasm
11743 s_waitcnt 0 ; Wait for all counters to be 0
11744 s_waitcnt vmcnt(0) & expcnt(0) & lgkmcnt(0) ; Equivalent to above
11745 s_waitcnt vmcnt(1) ; Wait for vmcnt counter to be 1.
11749 s_sendmsg sendmsg(MSG_INTERRUPT)
11752 For full list of supported instructions, refer to "SOPP Instructions" in ISA
11755 Unless otherwise mentioned, little verification is performed on the operands
11756 of SOPP Instructions, so it is up to the programmer to be familiar with the
11757 range or acceptable values.
11762 For vector ALU instruction opcodes (VOP1, VOP2, VOP3, VOPC, VOP_DPP, VOP_SDWA),
11763 the assembler will automatically use optimal encoding based on its operands. To
11764 force specific encoding, one can add a suffix to the opcode of the instruction:
11766 * _e32 for 32-bit VOP1/VOP2/VOPC
11767 * _e64 for 64-bit VOP3
11769 * _sdwa for VOP_SDWA
11771 VOP1/VOP2/VOP3/VOPC examples:
11773 .. code-block:: nasm
11776 v_mov_b32_e32 v1, v2
11778 v_cvt_f64_i32_e32 v[1:2], v2
11779 v_floor_f32_e32 v1, v2
11780 v_bfrev_b32_e32 v1, v2
11781 v_add_f32_e32 v1, v2, v3
11782 v_mul_i32_i24_e64 v1, v2, 3
11783 v_mul_i32_i24_e32 v1, -3, v3
11784 v_mul_i32_i24_e32 v1, -100, v3
11785 v_addc_u32 v1, s[0:1], v2, v3, s[2:3]
11786 v_max_f16_e32 v1, v2, v3
11790 .. code-block:: nasm
11792 v_mov_b32 v0, v0 quad_perm:[0,2,1,1]
11793 v_sin_f32 v0, v0 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
11794 v_mov_b32 v0, v0 wave_shl:1
11795 v_mov_b32 v0, v0 row_mirror
11796 v_mov_b32 v0, v0 row_bcast:31
11797 v_mov_b32 v0, v0 quad_perm:[1,3,0,1] row_mask:0xa bank_mask:0x1 bound_ctrl:0
11798 v_add_f32 v0, v0, |v0| row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
11799 v_max_f16 v1, v2, v3 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
11803 .. code-block:: nasm
11805 v_mov_b32 v1, v2 dst_sel:BYTE_0 dst_unused:UNUSED_PRESERVE src0_sel:DWORD
11806 v_min_u32 v200, v200, v1 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_1 src1_sel:DWORD
11807 v_sin_f32 v0, v0 dst_unused:UNUSED_PAD src0_sel:WORD_1
11808 v_fract_f32 v0, |v0| dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1
11809 v_cmpx_le_u32 vcc, v1, v2 src0_sel:BYTE_2 src1_sel:WORD_0
11811 For full list of supported instructions, refer to "Vector ALU instructions".
11813 .. _amdgpu-amdhsa-assembler-predefined-symbols-v2:
11815 Code Object V2 Predefined Symbols
11816 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
11819 Code object V2 is not the default code object version emitted by
11820 this version of LLVM.
11822 The AMDGPU assembler defines and updates some symbols automatically. These
11823 symbols do not affect code generation.
11825 .option.machine_version_major
11826 +++++++++++++++++++++++++++++
11828 Set to the GFX major generation number of the target being assembled for. For
11829 example, when assembling for a "GFX9" target this will be set to the integer
11830 value "9". The possible GFX major generation numbers are presented in
11831 :ref:`amdgpu-processors`.
11833 .option.machine_version_minor
11834 +++++++++++++++++++++++++++++
11836 Set to the GFX minor generation number of the target being assembled for. For
11837 example, when assembling for a "GFX810" target this will be set to the integer
11838 value "1". The possible GFX minor generation numbers are presented in
11839 :ref:`amdgpu-processors`.
11841 .option.machine_version_stepping
11842 ++++++++++++++++++++++++++++++++
11844 Set to the GFX stepping generation number of the target being assembled for.
11845 For example, when assembling for a "GFX704" target this will be set to the
11846 integer value "4". The possible GFX stepping generation numbers are presented
11847 in :ref:`amdgpu-processors`.
11852 Set to zero each time a
11853 :ref:`amdgpu-amdhsa-assembler-directive-amdgpu_hsa_kernel` directive is
11854 encountered. At each instruction, if the current value of this symbol is less
11855 than or equal to the maximum VGPR number explicitly referenced within that
11856 instruction then the symbol value is updated to equal that VGPR number plus
11862 Set to zero each time a
11863 :ref:`amdgpu-amdhsa-assembler-directive-amdgpu_hsa_kernel` directive is
11864 encountered. At each instruction, if the current value of this symbol is less
11865 than or equal to the maximum VGPR number explicitly referenced within that
11866 instruction then the symbol value is updated to equal that SGPR number plus
11869 .. _amdgpu-amdhsa-assembler-directives-v2:
11871 Code Object V2 Directives
11872 ~~~~~~~~~~~~~~~~~~~~~~~~~
11875 Code object V2 is not the default code object version emitted by
11876 this version of LLVM.
11878 AMDGPU ABI defines auxiliary data in output code object. In assembly source,
11879 one can specify them with assembler directives.
11881 .hsa_code_object_version major, minor
11882 +++++++++++++++++++++++++++++++++++++
11884 *major* and *minor* are integers that specify the version of the HSA code
11885 object that will be generated by the assembler.
11887 .hsa_code_object_isa [major, minor, stepping, vendor, arch]
11888 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
11891 *major*, *minor*, and *stepping* are all integers that describe the instruction
11892 set architecture (ISA) version of the assembly program.
11894 *vendor* and *arch* are quoted strings. *vendor* should always be equal to
11895 "AMD" and *arch* should always be equal to "AMDGPU".
11897 By default, the assembler will derive the ISA version, *vendor*, and *arch*
11898 from the value of the -mcpu option that is passed to the assembler.
11900 .. _amdgpu-amdhsa-assembler-directive-amdgpu_hsa_kernel:
11902 .amdgpu_hsa_kernel (name)
11903 +++++++++++++++++++++++++
11905 This directives specifies that the symbol with given name is a kernel entry
11906 point (label) and the object should contain corresponding symbol of type
11907 STT_AMDGPU_HSA_KERNEL.
11912 This directive marks the beginning of a list of key / value pairs that are used
11913 to specify the amd_kernel_code_t object that will be emitted by the assembler.
11914 The list must be terminated by the *.end_amd_kernel_code_t* directive. For any
11915 amd_kernel_code_t values that are unspecified a default value will be used. The
11916 default value for all keys is 0, with the following exceptions:
11918 - *amd_code_version_major* defaults to 1.
11919 - *amd_kernel_code_version_minor* defaults to 2.
11920 - *amd_machine_kind* defaults to 1.
11921 - *amd_machine_version_major*, *machine_version_minor*, and
11922 *amd_machine_version_stepping* are derived from the value of the -mcpu option
11923 that is passed to the assembler.
11924 - *kernel_code_entry_byte_offset* defaults to 256.
11925 - *wavefront_size* defaults 6 for all targets before GFX10. For GFX10 onwards
11926 defaults to 6 if target feature ``wavefrontsize64`` is enabled, otherwise 5.
11927 Note that wavefront size is specified as a power of two, so a value of **n**
11928 means a size of 2^ **n**.
11929 - *call_convention* defaults to -1.
11930 - *kernarg_segment_alignment*, *group_segment_alignment*, and
11931 *private_segment_alignment* default to 4. Note that alignments are specified
11932 as a power of 2, so a value of **n** means an alignment of 2^ **n**.
11933 - *enable_tg_split* defaults to 1 if target feature ``tgsplit`` is enabled for
11935 - *enable_wgp_mode* defaults to 1 if target feature ``cumode`` is disabled for
11937 - *enable_mem_ordered* defaults to 1 for GFX10 onwards.
11939 The *.amd_kernel_code_t* directive must be placed immediately after the
11940 function label and before any instructions.
11942 For a full list of amd_kernel_code_t keys, refer to AMDGPU ABI document,
11943 comments in lib/Target/AMDGPU/AmdKernelCodeT.h and test/CodeGen/AMDGPU/hsa.s.
11945 .. _amdgpu-amdhsa-assembler-example-v2:
11947 Code Object V2 Example Source Code
11948 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
11951 Code Object V2 is not the default code object version emitted by
11952 this version of LLVM.
11954 Here is an example of a minimal assembly source file, defining one HSA kernel:
11959 .hsa_code_object_version 1,0
11960 .hsa_code_object_isa
11965 .amdgpu_hsa_kernel hello_world
11970 enable_sgpr_kernarg_segment_ptr = 1
11972 compute_pgm_rsrc1_vgprs = 0
11973 compute_pgm_rsrc1_sgprs = 0
11974 compute_pgm_rsrc2_user_sgpr = 2
11975 compute_pgm_rsrc1_wgp_mode = 0
11976 compute_pgm_rsrc1_mem_ordered = 0
11977 compute_pgm_rsrc1_fwd_progress = 1
11978 .end_amd_kernel_code_t
11980 s_load_dwordx2 s[0:1], s[0:1] 0x0
11981 v_mov_b32 v0, 3.14159
11982 s_waitcnt lgkmcnt(0)
11985 flat_store_dword v[1:2], v0
11988 .size hello_world, .Lfunc_end0-hello_world
11990 .. _amdgpu-amdhsa-assembler-predefined-symbols-v3-v4:
11992 Code Object V3 to V4 Predefined Symbols
11993 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
11995 The AMDGPU assembler defines and updates some symbols automatically. These
11996 symbols do not affect code generation.
11998 .amdgcn.gfx_generation_number
11999 +++++++++++++++++++++++++++++
12001 Set to the GFX major generation number of the target being assembled for. For
12002 example, when assembling for a "GFX9" target this will be set to the integer
12003 value "9". The possible GFX major generation numbers are presented in
12004 :ref:`amdgpu-processors`.
12006 .amdgcn.gfx_generation_minor
12007 ++++++++++++++++++++++++++++
12009 Set to the GFX minor generation number of the target being assembled for. For
12010 example, when assembling for a "GFX810" target this will be set to the integer
12011 value "1". The possible GFX minor generation numbers are presented in
12012 :ref:`amdgpu-processors`.
12014 .amdgcn.gfx_generation_stepping
12015 +++++++++++++++++++++++++++++++
12017 Set to the GFX stepping generation number of the target being assembled for.
12018 For example, when assembling for a "GFX704" target this will be set to the
12019 integer value "4". The possible GFX stepping generation numbers are presented
12020 in :ref:`amdgpu-processors`.
12022 .. _amdgpu-amdhsa-assembler-symbol-next_free_vgpr:
12024 .amdgcn.next_free_vgpr
12025 ++++++++++++++++++++++
12027 Set to zero before assembly begins. At each instruction, if the current value
12028 of this symbol is less than or equal to the maximum VGPR number explicitly
12029 referenced within that instruction then the symbol value is updated to equal
12030 that VGPR number plus one.
12032 May be used to set the `.amdhsa_next_free_vgpr` directive in
12033 :ref:`amdhsa-kernel-directives-table`.
12035 May be set at any time, e.g. manually set to zero at the start of each kernel.
12037 .. _amdgpu-amdhsa-assembler-symbol-next_free_sgpr:
12039 .amdgcn.next_free_sgpr
12040 ++++++++++++++++++++++
12042 Set to zero before assembly begins. At each instruction, if the current value
12043 of this symbol is less than or equal the maximum SGPR number explicitly
12044 referenced within that instruction then the symbol value is updated to equal
12045 that SGPR number plus one.
12047 May be used to set the `.amdhsa_next_free_spgr` directive in
12048 :ref:`amdhsa-kernel-directives-table`.
12050 May be set at any time, e.g. manually set to zero at the start of each kernel.
12052 .. _amdgpu-amdhsa-assembler-directives-v3-v4:
12054 Code Object V3 to V4 Directives
12055 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
12057 Directives which begin with ``.amdgcn`` are valid for all ``amdgcn``
12058 architecture processors, and are not OS-specific. Directives which begin with
12059 ``.amdhsa`` are specific to ``amdgcn`` architecture processors when the
12060 ``amdhsa`` OS is specified. See :ref:`amdgpu-target-triples` and
12061 :ref:`amdgpu-processors`.
12063 .. _amdgpu-assembler-directive-amdgcn-target:
12065 .amdgcn_target <target-triple> "-" <target-id>
12066 ++++++++++++++++++++++++++++++++++++++++++++++
12068 Optional directive which declares the ``<target-triple>-<target-id>`` supported
12069 by the containing assembler source file. Used by the assembler to validate
12070 command-line options such as ``-triple``, ``-mcpu``, and
12071 ``--offload-arch=<target-id>``. A non-canonical target ID is allowed. See
12072 :ref:`amdgpu-target-triples` and :ref:`amdgpu-target-id`.
12076 The target ID syntax used for code object V2 to V3 for this directive differs
12077 from that used elsewhere. See :ref:`amdgpu-target-id-v2-v3`.
12079 .amdhsa_kernel <name>
12080 +++++++++++++++++++++
12082 Creates a correctly aligned AMDHSA kernel descriptor and a symbol,
12083 ``<name>.kd``, in the current location of the current section. Only valid when
12084 the OS is ``amdhsa``. ``<name>`` must be a symbol that labels the first
12085 instruction to execute, and does not need to be previously defined.
12087 Marks the beginning of a list of directives used to generate the bytes of a
12088 kernel descriptor, as described in :ref:`amdgpu-amdhsa-kernel-descriptor`.
12089 Directives which may appear in this list are described in
12090 :ref:`amdhsa-kernel-directives-table`. Directives may appear in any order, must
12091 be valid for the target being assembled for, and cannot be repeated. Directives
12092 support the range of values specified by the field they reference in
12093 :ref:`amdgpu-amdhsa-kernel-descriptor`. If a directive is not specified, it is
12094 assumed to have its default value, unless it is marked as "Required", in which
12095 case it is an error to omit the directive. This list of directives is
12096 terminated by an ``.end_amdhsa_kernel`` directive.
12098 .. table:: AMDHSA Kernel Assembler Directives
12099 :name: amdhsa-kernel-directives-table
12101 ======================================================== =================== ============ ===================
12102 Directive Default Supported On Description
12103 ======================================================== =================== ============ ===================
12104 ``.amdhsa_group_segment_fixed_size`` 0 GFX6-GFX10 Controls GROUP_SEGMENT_FIXED_SIZE in
12105 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
12106 ``.amdhsa_private_segment_fixed_size`` 0 GFX6-GFX10 Controls PRIVATE_SEGMENT_FIXED_SIZE in
12107 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
12108 ``.amdhsa_kernarg_size`` 0 GFX6-GFX10 Controls KERNARG_SIZE in
12109 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
12110 ``.amdhsa_user_sgpr_private_segment_buffer`` 0 GFX6-GFX10 Controls ENABLE_SGPR_PRIVATE_SEGMENT_BUFFER in
12111 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
12112 ``.amdhsa_user_sgpr_dispatch_ptr`` 0 GFX6-GFX10 Controls ENABLE_SGPR_DISPATCH_PTR in
12113 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
12114 ``.amdhsa_user_sgpr_queue_ptr`` 0 GFX6-GFX10 Controls ENABLE_SGPR_QUEUE_PTR in
12115 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
12116 ``.amdhsa_user_sgpr_kernarg_segment_ptr`` 0 GFX6-GFX10 Controls ENABLE_SGPR_KERNARG_SEGMENT_PTR in
12117 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
12118 ``.amdhsa_user_sgpr_dispatch_id`` 0 GFX6-GFX10 Controls ENABLE_SGPR_DISPATCH_ID in
12119 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
12120 ``.amdhsa_user_sgpr_flat_scratch_init`` 0 GFX6-GFX10 Controls ENABLE_SGPR_FLAT_SCRATCH_INIT in
12121 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
12122 ``.amdhsa_user_sgpr_private_segment_size`` 0 GFX6-GFX10 Controls ENABLE_SGPR_PRIVATE_SEGMENT_SIZE in
12123 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
12124 ``.amdhsa_wavefront_size32`` Target GFX10 Controls ENABLE_WAVEFRONT_SIZE32 in
12125 Feature :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
12128 ``.amdhsa_system_sgpr_private_segment_wavefront_offset`` 0 GFX6-GFX10 Controls ENABLE_PRIVATE_SEGMENT in
12129 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`.
12130 ``.amdhsa_system_sgpr_workgroup_id_x`` 1 GFX6-GFX10 Controls ENABLE_SGPR_WORKGROUP_ID_X in
12131 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`.
12132 ``.amdhsa_system_sgpr_workgroup_id_y`` 0 GFX6-GFX10 Controls ENABLE_SGPR_WORKGROUP_ID_Y in
12133 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`.
12134 ``.amdhsa_system_sgpr_workgroup_id_z`` 0 GFX6-GFX10 Controls ENABLE_SGPR_WORKGROUP_ID_Z in
12135 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`.
12136 ``.amdhsa_system_sgpr_workgroup_info`` 0 GFX6-GFX10 Controls ENABLE_SGPR_WORKGROUP_INFO in
12137 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`.
12138 ``.amdhsa_system_vgpr_workitem_id`` 0 GFX6-GFX10 Controls ENABLE_VGPR_WORKITEM_ID in
12139 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`.
12140 Possible values are defined in
12141 :ref:`amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table`.
12142 ``.amdhsa_next_free_vgpr`` Required GFX6-GFX10 Maximum VGPR number explicitly referenced, plus one.
12143 Used to calculate GRANULATED_WORKITEM_VGPR_COUNT in
12144 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
12145 ``.amdhsa_next_free_sgpr`` Required GFX6-GFX10 Maximum SGPR number explicitly referenced, plus one.
12146 Used to calculate GRANULATED_WAVEFRONT_SGPR_COUNT in
12147 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
12148 ``.amdhsa_accum_offset`` Required GFX90A Offset of a first AccVGPR in the unified register file.
12149 Used to calculate ACCUM_OFFSET in
12150 :ref:`amdgpu-amdhsa-compute_pgm_rsrc3-gfx90a-table`.
12151 ``.amdhsa_reserve_vcc`` 1 GFX6-GFX10 Whether the kernel may use the special VCC SGPR.
12152 Used to calculate GRANULATED_WAVEFRONT_SGPR_COUNT in
12153 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
12154 ``.amdhsa_reserve_flat_scratch`` 1 GFX7-GFX10 Whether the kernel may use flat instructions to access
12155 scratch memory. Used to calculate
12156 GRANULATED_WAVEFRONT_SGPR_COUNT in
12157 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
12158 ``.amdhsa_reserve_xnack_mask`` Target GFX8-GFX10 Whether the kernel may trigger XNACK replay.
12159 Feature Used to calculate GRANULATED_WAVEFRONT_SGPR_COUNT in
12160 Specific :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
12162 ``.amdhsa_float_round_mode_32`` 0 GFX6-GFX10 Controls FLOAT_ROUND_MODE_32 in
12163 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
12164 Possible values are defined in
12165 :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
12166 ``.amdhsa_float_round_mode_16_64`` 0 GFX6-GFX10 Controls FLOAT_ROUND_MODE_16_64 in
12167 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
12168 Possible values are defined in
12169 :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
12170 ``.amdhsa_float_denorm_mode_32`` 0 GFX6-GFX10 Controls FLOAT_DENORM_MODE_32 in
12171 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
12172 Possible values are defined in
12173 :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
12174 ``.amdhsa_float_denorm_mode_16_64`` 3 GFX6-GFX10 Controls FLOAT_DENORM_MODE_16_64 in
12175 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
12176 Possible values are defined in
12177 :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
12178 ``.amdhsa_dx10_clamp`` 1 GFX6-GFX10 Controls ENABLE_DX10_CLAMP in
12179 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
12180 ``.amdhsa_ieee_mode`` 1 GFX6-GFX10 Controls ENABLE_IEEE_MODE in
12181 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
12182 ``.amdhsa_fp16_overflow`` 0 GFX9-GFX10 Controls FP16_OVFL in
12183 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
12184 ``.amdhsa_tg_split`` Target GFX90A Controls TG_SPLIT in
12185 Feature :ref:`amdgpu-amdhsa-compute_pgm_rsrc3-gfx90a-table`.
12188 ``.amdhsa_workgroup_processor_mode`` Target GFX10 Controls ENABLE_WGP_MODE in
12189 Feature :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
12192 ``.amdhsa_memory_ordered`` 1 GFX10 Controls MEM_ORDERED in
12193 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
12194 ``.amdhsa_forward_progress`` 0 GFX10 Controls FWD_PROGRESS in
12195 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
12196 ``.amdhsa_exception_fp_ieee_invalid_op`` 0 GFX6-GFX10 Controls ENABLE_EXCEPTION_IEEE_754_FP_INVALID_OPERATION in
12197 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`.
12198 ``.amdhsa_exception_fp_denorm_src`` 0 GFX6-GFX10 Controls ENABLE_EXCEPTION_FP_DENORMAL_SOURCE in
12199 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`.
12200 ``.amdhsa_exception_fp_ieee_div_zero`` 0 GFX6-GFX10 Controls ENABLE_EXCEPTION_IEEE_754_FP_DIVISION_BY_ZERO in
12201 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`.
12202 ``.amdhsa_exception_fp_ieee_overflow`` 0 GFX6-GFX10 Controls ENABLE_EXCEPTION_IEEE_754_FP_OVERFLOW in
12203 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`.
12204 ``.amdhsa_exception_fp_ieee_underflow`` 0 GFX6-GFX10 Controls ENABLE_EXCEPTION_IEEE_754_FP_UNDERFLOW in
12205 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`.
12206 ``.amdhsa_exception_fp_ieee_inexact`` 0 GFX6-GFX10 Controls ENABLE_EXCEPTION_IEEE_754_FP_INEXACT in
12207 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`.
12208 ``.amdhsa_exception_int_div_zero`` 0 GFX6-GFX10 Controls ENABLE_EXCEPTION_INT_DIVIDE_BY_ZERO in
12209 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`.
12210 ======================================================== =================== ============ ===================
12215 Optional directive which declares the contents of the ``NT_AMDGPU_METADATA``
12216 note record (see :ref:`amdgpu-elf-note-records-table-v3-v4`).
12218 The contents must be in the [YAML]_ markup format, with the same structure and
12219 semantics described in :ref:`amdgpu-amdhsa-code-object-metadata-v3` or
12220 :ref:`amdgpu-amdhsa-code-object-metadata-v4`.
12222 This directive is terminated by an ``.end_amdgpu_metadata`` directive.
12224 .. _amdgpu-amdhsa-assembler-example-v3-v4:
12226 Code Object V3 to V4 Example Source Code
12227 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
12229 Here is an example of a minimal assembly source file, defining one HSA kernel:
12234 .amdgcn_target "amdgcn-amd-amdhsa--gfx900+xnack" // optional
12239 .type hello_world,@function
12241 s_load_dwordx2 s[0:1], s[0:1] 0x0
12242 v_mov_b32 v0, 3.14159
12243 s_waitcnt lgkmcnt(0)
12246 flat_store_dword v[1:2], v0
12249 .size hello_world, .Lfunc_end0-hello_world
12253 .amdhsa_kernel hello_world
12254 .amdhsa_user_sgpr_kernarg_segment_ptr 1
12255 .amdhsa_next_free_vgpr .amdgcn.next_free_vgpr
12256 .amdhsa_next_free_sgpr .amdgcn.next_free_sgpr
12265 - .name: hello_world
12266 .symbol: hello_world.kd
12267 .kernarg_segment_size: 48
12268 .group_segment_fixed_size: 0
12269 .private_segment_fixed_size: 0
12270 .kernarg_segment_align: 4
12271 .wavefront_size: 64
12274 .max_flat_workgroup_size: 256
12278 .value_kind: global_buffer
12279 .address_space: global
12280 .actual_access: write_only
12282 .end_amdgpu_metadata
12284 This kernel is equivalent to the following HIP program:
12289 __global__ void hello_world(float *p) {
12293 If an assembly source file contains multiple kernels and/or functions, the
12294 :ref:`amdgpu-amdhsa-assembler-symbol-next_free_vgpr` and
12295 :ref:`amdgpu-amdhsa-assembler-symbol-next_free_sgpr` symbols may be reset using
12296 the ``.set <symbol>, <expression>`` directive. For example, in the case of two
12297 kernels, where ``function1`` is only called from ``kernel1`` it is sufficient
12298 to group the function with the kernel that calls it and reset the symbols
12299 between the two connected components:
12304 .amdgcn_target "amdgcn-amd-amdhsa--gfx900+xnack" // optional
12306 // gpr tracking symbols are implicitly set to zero
12311 .type kern0,@function
12316 .size kern0, .Lkern0_end-kern0
12320 .amdhsa_kernel kern0
12322 .amdhsa_next_free_vgpr .amdgcn.next_free_vgpr
12323 .amdhsa_next_free_sgpr .amdgcn.next_free_sgpr
12326 // reset symbols to begin tracking usage in func1 and kern1
12327 .set .amdgcn.next_free_vgpr, 0
12328 .set .amdgcn.next_free_sgpr, 0
12334 .type func1,@function
12337 s_setpc_b64 s[30:31]
12339 .size func1, .Lfunc1_end-func1
12343 .type kern1,@function
12347 s_add_u32 s4, s4, func1@rel32@lo+4
12348 s_addc_u32 s5, s5, func1@rel32@lo+4
12349 s_swappc_b64 s[30:31], s[4:5]
12353 .size kern1, .Lkern1_end-kern1
12357 .amdhsa_kernel kern1
12359 .amdhsa_next_free_vgpr .amdgcn.next_free_vgpr
12360 .amdhsa_next_free_sgpr .amdgcn.next_free_sgpr
12363 These symbols cannot identify connected components in order to automatically
12364 track the usage for each kernel. However, in some cases careful organization of
12365 the kernels and functions in the source file means there is minimal additional
12366 effort required to accurately calculate GPR usage.
12368 Additional Documentation
12369 ========================
12371 .. [AMD-GCN-GFX6] `AMD Southern Islands Series ISA <http://developer.amd.com/wordpress/media/2012/12/AMD_Southern_Islands_Instruction_Set_Architecture.pdf>`__
12372 .. [AMD-GCN-GFX7] `AMD Sea Islands Series ISA <http://developer.amd.com/wordpress/media/2013/07/AMD_Sea_Islands_Instruction_Set_Architecture.pdf>`_
12373 .. [AMD-GCN-GFX8] `AMD GCN3 Instruction Set Architecture <http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/12/AMD_GCN3_Instruction_Set_Architecture_rev1.1.pdf>`__
12374 .. [AMD-GCN-GFX900-GFX904-VEGA] `AMD Vega Instruction Set Architecture <http://developer.amd.com/wordpress/media/2013/12/Vega_Shader_ISA_28July2017.pdf>`__
12375 .. [AMD-GCN-GFX906-VEGA7NM] `AMD Vega 7nm Instruction Set Architecture <https://gpuopen.com/wp-content/uploads/2019/11/Vega_7nm_Shader_ISA_26November2019.pdf>`__
12376 .. [AMD-GCN-GFX908-CDNA1] `AMD Instinct MI100 Instruction Set Architecture <https://developer.amd.com/wp-content/resources/CDNA1_Shader_ISA_14December2020.pdf>`__
12377 .. [AMD-GCN-GFX10-RDNA1] `AMD RDNA 1.0 Instruction Set Architecture <https://gpuopen.com/wp-content/uploads/2019/08/RDNA_Shader_ISA_5August2019.pdf>`__
12378 .. [AMD-GCN-GFX10-RDNA2] `AMD RDNA 2 Instruction Set Architecture <https://developer.amd.com/wp-content/resources/RDNA2_Shader_ISA_November2020.pdf>`__
12379 .. [AMD-RADEON-HD-2000-3000] `AMD R6xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R600_Instruction_Set_Architecture.pdf>`__
12380 .. [AMD-RADEON-HD-4000] `AMD R7xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R700-Family_Instruction_Set_Architecture.pdf>`__
12381 .. [AMD-RADEON-HD-5000] `AMD Evergreen shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_Evergreen-Family_Instruction_Set_Architecture.pdf>`__
12382 .. [AMD-RADEON-HD-6000] `AMD Cayman/Trinity shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_HD_6900_Series_Instruction_Set_Architecture.pdf>`__
12383 .. [AMD-ROCm] `AMD ROCm™ Platform <https://rocmdocs.amd.com/>`__
12384 .. [AMD-ROCm-github] `AMD ROCm™ github <http://github.com/RadeonOpenCompute>`__
12385 .. [AMD-ROCm-Release-Notes] `AMD ROCm Release Notes <https://github.com/RadeonOpenCompute/ROCm>`__
12386 .. [CLANG-ATTR] `Attributes in Clang <https://clang.llvm.org/docs/AttributeReference.html>`__
12387 .. [DWARF] `DWARF Debugging Information Format <http://dwarfstd.org/>`__
12388 .. [ELF] `Executable and Linkable Format (ELF) <http://www.sco.com/developers/gabi/>`__
12389 .. [HRF] `Heterogeneous-race-free Memory Models <http://benedictgaster.org/wp-content/uploads/2014/01/asplos269-FINAL.pdf>`__
12390 .. [HSA] `Heterogeneous System Architecture (HSA) Foundation <http://www.hsafoundation.com/>`__
12391 .. [MsgPack] `Message Pack <http://www.msgpack.org/>`__
12392 .. [OpenCL] `The OpenCL Specification Version 2.0 <http://www.khronos.org/registry/cl/specs/opencl-2.0.pdf>`__
12393 .. [SEMVER] `Semantic Versioning <https://semver.org/>`__
12394 .. [YAML] `YAML Ain't Markup Language (YAML™) Version 1.2 <http://www.yaml.org/spec/1.2/spec.html>`__