1 =============================
2 User Guide for AMDGPU Backend
3 =============================
14 AMDGPU/AMDGPUAsmGFX900
15 AMDGPU/AMDGPUAsmGFX904
16 AMDGPU/AMDGPUAsmGFX906
17 AMDGPU/AMDGPUAsmGFX908
18 AMDGPU/AMDGPUAsmGFX90a
20 AMDGPU/AMDGPUAsmGFX1011
23 AMDGPUInstructionSyntax
24 AMDGPUInstructionNotation
25 AMDGPUDwarfExtensionsForHeterogeneousDebugging
30 The AMDGPU backend provides ISA code generation for AMD GPUs, starting with the
31 R600 family up until the current GCN families. It lives in the
32 ``llvm/lib/Target/AMDGPU`` directory.
37 .. _amdgpu-target-triples:
42 Use the Clang option ``-target <Architecture>-<Vendor>-<OS>-<Environment>``
43 to specify the target triple:
45 .. table:: AMDGPU Architectures
46 :name: amdgpu-architecture-table
48 ============ ==============================================================
49 Architecture Description
50 ============ ==============================================================
51 ``r600`` AMD GPUs HD2XXX-HD6XXX for graphics and compute shaders.
52 ``amdgcn`` AMD GPUs GCN GFX6 onwards for graphics and compute shaders.
53 ============ ==============================================================
55 .. table:: AMDGPU Vendors
56 :name: amdgpu-vendor-table
58 ============ ==============================================================
60 ============ ==============================================================
61 ``amd`` Can be used for all AMD GPU usage.
62 ``mesa3d`` Can be used if the OS is ``mesa3d``.
63 ============ ==============================================================
65 .. table:: AMDGPU Operating Systems
68 ============== ============================================================
70 ============== ============================================================
71 *<empty>* Defaults to the *unknown* OS.
72 ``amdhsa`` Compute kernels executed on HSA [HSA]_ compatible runtimes
75 - AMD's ROCm™ runtime [AMD-ROCm]_ using the *rocm-amdhsa*
76 loader on Linux. See *AMD ROCm Platform Release Notes*
77 [AMD-ROCm-Release-Notes]_ for supported hardware and
79 - AMD's PAL runtime using the *pal-amdhsa* loader on
82 ``amdpal`` Graphic shaders and compute kernels executed on AMD's PAL
83 runtime using the *pal-amdpal* loader on Windows and Linux
85 ``mesa3d`` Graphic shaders and compute kernels executed on AMD's Mesa
86 3D runtime using the *mesa-mesa3d* loader on Linux.
87 ============== ============================================================
89 .. table:: AMDGPU Environments
90 :name: amdgpu-environment-table
92 ============ ==============================================================
93 Environment Description
94 ============ ==============================================================
96 ============ ==============================================================
98 .. _amdgpu-processors:
103 Use the Clang options ``-mcpu=<target-id>`` or ``--offload-arch=<target-id>`` to
104 specify the AMDGPU processor together with optional target features. See
105 :ref:`amdgpu-target-id` and :ref:`amdgpu-target-features` for AMD GPU target
106 specific information.
108 Every processor supports every OS ABI (see :ref:`amdgpu-os`) with the following exceptions:
110 * ``amdhsa`` is not supported in ``r600`` architecture (see :ref:`amdgpu-architecture-table`).
113 .. table:: AMDGPU Processors
114 :name: amdgpu-processor-table
116 =========== =============== ============ ===== ================= =============== =============== ======================
117 Processor Alternative Target dGPU/ Target Target OS Support Example
118 Processor Triple APU Features Properties *(see* Products
119 Architecture Supported `amdgpu-os`_
128 =========== =============== ============ ===== ================= =============== =============== ======================
129 **Radeon HD 2000/3000 Series (R600)** [AMD-RADEON-HD-2000-3000]_
130 -----------------------------------------------------------------------------------------------------------------------
131 ``r600`` ``r600`` dGPU - Does not
136 ``r630`` ``r600`` dGPU - Does not
141 ``rs880`` ``r600`` dGPU - Does not
146 ``rv670`` ``r600`` dGPU - Does not
151 **Radeon HD 4000 Series (R700)** [AMD-RADEON-HD-4000]_
152 -----------------------------------------------------------------------------------------------------------------------
153 ``rv710`` ``r600`` dGPU - Does not
158 ``rv730`` ``r600`` dGPU - Does not
163 ``rv770`` ``r600`` dGPU - Does not
168 **Radeon HD 5000 Series (Evergreen)** [AMD-RADEON-HD-5000]_
169 -----------------------------------------------------------------------------------------------------------------------
170 ``cedar`` ``r600`` dGPU - Does not
175 ``cypress`` ``r600`` dGPU - Does not
180 ``juniper`` ``r600`` dGPU - Does not
185 ``redwood`` ``r600`` dGPU - Does not
190 ``sumo`` ``r600`` dGPU - Does not
195 **Radeon HD 6000 Series (Northern Islands)** [AMD-RADEON-HD-6000]_
196 -----------------------------------------------------------------------------------------------------------------------
197 ``barts`` ``r600`` dGPU - Does not
202 ``caicos`` ``r600`` dGPU - Does not
207 ``cayman`` ``r600`` dGPU - Does not
212 ``turks`` ``r600`` dGPU - Does not
217 **GCN GFX6 (Southern Islands (SI))** [AMD-GCN-GFX6]_
218 -----------------------------------------------------------------------------------------------------------------------
219 ``gfx600`` - ``tahiti`` ``amdgcn`` dGPU - Does not - *pal-amdpal*
224 ``gfx601`` - ``pitcairn`` ``amdgcn`` dGPU - Does not - *pal-amdpal*
229 ``gfx602`` - ``hainan`` ``amdgcn`` dGPU - Does not - *pal-amdpal*
234 **GCN GFX7 (Sea Islands (CI))** [AMD-GCN-GFX7]_
235 -----------------------------------------------------------------------------------------------------------------------
236 ``gfx700`` - ``kaveri`` ``amdgcn`` APU - Offset - *rocm-amdhsa* - A6-7000
237 flat - *pal-amdhsa* - A6 Pro-7050B
238 scratch - *pal-amdpal* - A8-7100
246 ``gfx701`` - ``hawaii`` ``amdgcn`` dGPU - Offset - *rocm-amdhsa* - FirePro W8100
247 flat - *pal-amdhsa* - FirePro W9100
248 scratch - *pal-amdpal* - FirePro S9150
250 ``gfx702`` ``amdgcn`` dGPU - Offset - *rocm-amdhsa* - Radeon R9 290
251 flat - *pal-amdhsa* - Radeon R9 290x
252 scratch - *pal-amdpal* - Radeon R390
254 ``gfx703`` - ``kabini`` ``amdgcn`` APU - Offset - *pal-amdhsa* - E1-2100
255 - ``mullins`` flat - *pal-amdpal* - E1-2200
263 ``gfx704`` - ``bonaire`` ``amdgcn`` dGPU - Offset - *pal-amdhsa* - Radeon HD 7790
264 flat - *pal-amdpal* - Radeon HD 8770
267 ``gfx705`` ``amdgcn`` APU - Offset - *pal-amdhsa* *TBA*
274 **GCN GFX8 (Volcanic Islands (VI))** [AMD-GCN-GFX8]_
275 -----------------------------------------------------------------------------------------------------------------------
276 ``gfx801`` - ``carrizo`` ``amdgcn`` APU - xnack - Offset - *rocm-amdhsa* - A6-8500P
277 flat - *pal-amdhsa* - Pro A6-8500B
278 scratch - *pal-amdpal* - A8-8600P
294 ``gfx802`` - ``iceland`` ``amdgcn`` dGPU - Offset - *rocm-amdhsa* - Radeon R9 285
295 - ``tonga`` flat - *pal-amdhsa* - Radeon R9 380
296 scratch - *pal-amdpal* - Radeon R9 385
297 ``gfx803`` - ``fiji`` ``amdgcn`` dGPU - *rocm-amdhsa* - Radeon R9 Nano
298 - *pal-amdhsa* - Radeon R9 Fury
299 - *pal-amdpal* - Radeon R9 FuryX
302 - Radeon Instinct MI8
303 \ - ``polaris10`` ``amdgcn`` dGPU - Offset - *rocm-amdhsa* - Radeon RX 470
304 flat - *pal-amdhsa* - Radeon RX 480
305 scratch - *pal-amdpal* - Radeon Instinct MI6
306 \ - ``polaris11`` ``amdgcn`` dGPU - Offset - *rocm-amdhsa* - Radeon RX 460
308 scratch - *pal-amdpal*
309 ``gfx805`` - ``tongapro`` ``amdgcn`` dGPU - Offset - *rocm-amdhsa* - FirePro S7150
310 flat - *pal-amdhsa* - FirePro S7100
311 scratch - *pal-amdpal* - FirePro W7100
314 ``gfx810`` - ``stoney`` ``amdgcn`` APU - xnack - Offset - *rocm-amdhsa* *TBA*
316 scratch - *pal-amdpal* .. TODO::
321 **GCN GFX9 (Vega)** [AMD-GCN-GFX900-GFX904-VEGA]_ [AMD-GCN-GFX906-VEGA7NM]_ [AMD-GCN-GFX908-CDNA1]_
322 -----------------------------------------------------------------------------------------------------------------------
323 ``gfx900`` ``amdgcn`` dGPU - xnack - Absolute - *rocm-amdhsa* - Radeon Vega
324 flat - *pal-amdhsa* Frontier Edition
325 scratch - *pal-amdpal* - Radeon RX Vega 56
329 - Radeon Instinct MI25
330 ``gfx902`` ``amdgcn`` APU - xnack - Absolute - *rocm-amdhsa* - Ryzen 3 2200G
331 flat - *pal-amdhsa* - Ryzen 5 2400G
332 scratch - *pal-amdpal*
333 ``gfx904`` ``amdgcn`` dGPU - xnack - *rocm-amdhsa* *TBA*
335 - *pal-amdpal* .. TODO::
340 ``gfx906`` ``amdgcn`` dGPU - sramecc - Absolute - *rocm-amdhsa* - Radeon Instinct MI50
341 - xnack flat - *pal-amdhsa* - Radeon Instinct MI60
342 scratch - *pal-amdpal* - Radeon VII
344 ``gfx908`` ``amdgcn`` dGPU - sramecc - *rocm-amdhsa* - AMD Instinct MI100 Accelerator
348 ``gfx909`` ``amdgcn`` APU - xnack - Absolute - *pal-amdpal* *TBA*
355 ``gfx90a`` ``amdgcn`` dGPU - sramecc - Absolute - *rocm-amdhsa* *TBA*
357 - xnack scratch .. TODO::
359 work-item Add product
362 ``gfx90c`` ``amdgcn`` APU - xnack - Absolute - *pal-amdpal* - Ryzen 7 4700G
363 flat - Ryzen 7 4700GE
364 scratch - Ryzen 5 4600G
376 **GCN GFX10 (RDNA 1)** [AMD-GCN-GFX10-RDNA1]_
377 -----------------------------------------------------------------------------------------------------------------------
378 ``gfx1010`` ``amdgcn`` dGPU - cumode - Absolute - *rocm-amdhsa* - Radeon RX 5700
379 - wavefrontsize64 flat - *pal-amdhsa* - Radeon RX 5700 XT
380 - xnack scratch - *pal-amdpal* - Radeon Pro 5600 XT
382 ``gfx1011`` ``amdgcn`` dGPU - cumode - *rocm-amdhsa* - Radeon Pro V520
383 - wavefrontsize64 - Absolute - *pal-amdhsa*
384 - xnack flat - *pal-amdpal*
386 ``gfx1012`` ``amdgcn`` dGPU - cumode - Absolute - *rocm-amdhsa* - Radeon RX 5500
387 - wavefrontsize64 flat - *pal-amdhsa* - Radeon RX 5500 XT
388 - xnack scratch - *pal-amdpal*
389 ``gfx1013`` ``amdgcn`` APU - cumode - Absolute - *rocm-amdhsa* *TBA*
390 - wavefrontsize64 flat - *pal-amdhsa*
391 - xnack scratch - *pal-amdpal* .. TODO::
396 **GCN GFX10 (RDNA 2)** [AMD-GCN-GFX10-RDNA2]_
397 -----------------------------------------------------------------------------------------------------------------------
398 ``gfx1030`` ``amdgcn`` dGPU - cumode - Absolute - *rocm-amdhsa* - Radeon RX 6800
399 - wavefrontsize64 flat - *pal-amdhsa* - Radeon RX 6800 XT
400 scratch - *pal-amdpal* - Radeon RX 6900 XT
401 ``gfx1031`` ``amdgcn`` dGPU - cumode - Absolute - *rocm-amdhsa* - Radeon RX 6700 XT
402 - wavefrontsize64 flat - *pal-amdhsa*
403 scratch - *pal-amdpal*
404 ``gfx1032`` ``amdgcn`` dGPU - cumode - Absolute - *rocm-amdhsa* *TBA*
405 - wavefrontsize64 flat - *pal-amdhsa*
406 scratch - *pal-amdpal* .. TODO::
411 ``gfx1033`` ``amdgcn`` APU - cumode - Absolute - *pal-amdpal* *TBA*
412 - wavefrontsize64 flat
417 ``gfx1034`` ``amdgcn`` dGPU - cumode - Absolute - *pal-amdpal* *TBA*
418 - wavefrontsize64 flat
424 ``gfx1035`` ``amdgcn`` APU - cumode - Absolute - *pal-amdpal* *TBA*
425 - wavefrontsize64 flat
430 =========== =============== ============ ===== ================= =============== =============== ======================
432 .. _amdgpu-target-features:
437 Target features control how code is generated to support certain
438 processor specific features. Not all target features are supported by
439 all processors. The runtime must ensure that the features supported by
440 the device used to execute the code match the features enabled when
441 generating the code. A mismatch of features may result in incorrect
442 execution, or a reduction in performance.
444 The target features supported by each processor is listed in
445 :ref:`amdgpu-processor-table`.
447 Target features are controlled by exactly one of the following Clang
450 ``-mcpu=<target-id>`` or ``--offload-arch=<target-id>``
452 The ``-mcpu`` and ``--offload-arch`` can specify the target feature as
453 optional components of the target ID. If omitted, the target feature has the
454 ``any`` value. See :ref:`amdgpu-target-id`.
456 ``-m[no-]<target-feature>``
458 Target features not specified by the target ID are specified using a
459 separate option. These target features can have an ``on`` or ``off``
460 value. ``on`` is specified by omitting the ``no-`` prefix, and
461 ``off`` is specified by including the ``no-`` prefix. The default
462 if not specified is ``off``.
466 ``-mcpu=gfx908:xnack+``
467 Enable the ``xnack`` feature.
468 ``-mcpu=gfx908:xnack-``
469 Disable the ``xnack`` feature.
471 Enable the ``cumode`` feature.
473 Disable the ``cumode`` feature.
475 .. table:: AMDGPU Target Features
476 :name: amdgpu-target-features-table
478 =============== ============================ ==================================================
479 Target Feature Clang Option to Control Description
481 =============== ============================ ==================================================
482 cumode - ``-m[no-]cumode`` Control the wavefront execution mode used
483 when generating code for kernels. When disabled
484 native WGP wavefront execution mode is used,
485 when enabled CU wavefront execution mode is used
486 (see :ref:`amdgpu-amdhsa-memory-model`).
488 sramecc - ``-mcpu`` If specified, generate code that can only be
489 - ``--offload-arch`` loaded and executed in a process that has a
490 matching setting for SRAMECC.
492 If not specified for code object V2 to V3, generate
493 code that can be loaded and executed in a process
494 with SRAMECC enabled.
496 If not specified for code object V4, generate
497 code that can be loaded and executed in a process
498 with either setting of SRAMECC.
500 tgsplit ``-m[no-]tgsplit`` Enable/disable generating code that assumes
501 work-groups are launched in threadgroup split mode.
502 When enabled the waves of a work-group may be
503 launched in different CUs.
505 wavefrontsize64 - ``-m[no-]wavefrontsize64`` Control the wavefront size used when
506 generating code for kernels. When disabled
507 native wavefront size 32 is used, when enabled
508 wavefront size 64 is used.
510 xnack - ``-mcpu`` If specified, generate code that can only be
511 - ``--offload-arch`` loaded and executed in a process that has a
512 matching setting for XNACK replay.
514 If not specified for code object V2 to V3, generate
515 code that can be loaded and executed in a process
516 with XNACK replay enabled.
518 If not specified for code object V4, generate
519 code that can be loaded and executed in a process
520 with either setting of XNACK replay.
522 XNACK replay can be used for demand paging and
523 page migration. If enabled in the device, then if
524 a page fault occurs the code may execute
525 incorrectly unless generated with XNACK replay
526 enabled, or generated for code object V4 without
527 specifying XNACK replay. Executing code that was
528 generated with XNACK replay enabled, or generated
529 for code object V4 without specifying XNACK replay,
530 on a device that does not have XNACK replay
531 enabled will execute correctly but may be less
532 performant than code generated for XNACK replay
534 =============== ============================ ==================================================
536 .. _amdgpu-target-id:
541 AMDGPU supports target IDs. See `Clang Offload Bundler
542 <https://clang.llvm.org/docs/ClangOffloadBundler.html>`_ for a general
543 description. The AMDGPU target specific information is:
546 Is an AMDGPU processor or alternative processor name specified in
547 :ref:`amdgpu-processor-table`. The non-canonical form target ID allows both
548 the primary processor and alternative processor names. The canonical form
549 target ID only allow the primary processor name.
552 Is a target feature name specified in :ref:`amdgpu-target-features-table` that
553 is supported by the processor. The target features supported by each processor
554 is specified in :ref:`amdgpu-processor-table`. Those that can be specified in
555 a target ID are marked as being controlled by ``-mcpu`` and
556 ``--offload-arch``. Each target feature must appear at most once in a target
557 ID. The non-canonical form target ID allows the target features to be
558 specified in any order. The canonical form target ID requires the target
559 features to be specified in alphabetic order.
561 .. _amdgpu-target-id-v2-v3:
563 Code Object V2 to V3 Target ID
564 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
566 The target ID syntax for code object V2 to V3 is the same as defined in `Clang
567 Offload Bundler <https://clang.llvm.org/docs/ClangOffloadBundler.html>`_ except
568 when used in the :ref:`amdgpu-assembler-directive-amdgcn-target` assembler
569 directive and the bundle entry ID. In those cases it has the following BNF
574 <target-id> ::== <processor> ( "+" <target-feature> )*
576 Where a target feature is omitted if *Off* and present if *On* or *Any*.
580 The code object V2 to V3 cannot represent *Any* and treats it the same as
583 .. _amdgpu-embedding-bundled-objects:
585 Embedding Bundled Code Objects
586 ------------------------------
588 AMDGPU supports the HIP and OpenMP languages that perform code object embedding
589 as described in `Clang Offload Bundler
590 <https://clang.llvm.org/docs/ClangOffloadBundler.html>`_.
594 The target ID syntax used for code object V2 to V3 for a bundle entry ID
595 differs from that used elsewhere. See :ref:`amdgpu-target-id-v2-v3`.
597 .. _amdgpu-address-spaces:
602 The AMDGPU architecture supports a number of memory address spaces. The address
603 space names use the OpenCL standard names, with some additions.
605 The AMDGPU address spaces correspond to target architecture specific LLVM
606 address space numbers used in LLVM IR.
608 The AMDGPU address spaces are described in
609 :ref:`amdgpu-address-spaces-table`. Only 64-bit process address spaces are
610 supported for the ``amdgcn`` target.
612 .. table:: AMDGPU Address Spaces
613 :name: amdgpu-address-spaces-table
615 ================================= =============== =========== ================ ======= ============================
616 .. 64-Bit Process Address Space
617 --------------------------------- --------------- ----------- ---------------- ------------------------------------
618 Address Space Name LLVM IR Address HSA Segment Hardware Address NULL Value
619 Space Number Name Name Size
620 ================================= =============== =========== ================ ======= ============================
621 Generic 0 flat flat 64 0x0000000000000000
622 Global 1 global global 64 0x0000000000000000
623 Region 2 N/A GDS 32 *not implemented for AMDHSA*
624 Local 3 group LDS 32 0xFFFFFFFF
625 Constant 4 constant *same as global* 64 0x0000000000000000
626 Private 5 private scratch 32 0xFFFFFFFF
627 Constant 32-bit 6 *TODO* 0x00000000
628 Buffer Fat Pointer (experimental) 7 *TODO*
629 ================================= =============== =========== ================ ======= ============================
632 The generic address space is supported unless the *Target Properties* column
633 of :ref:`amdgpu-processor-table` specifies *Does not support generic address
636 The generic address space uses the hardware flat address support for two fixed
637 ranges of virtual addresses (the private and local apertures), that are
638 outside the range of addressable global memory, to map from a flat address to
639 a private or local address. This uses FLAT instructions that can take a flat
640 address and access global, private (scratch), and group (LDS) memory depending
641 on if the address is within one of the aperture ranges.
643 Flat access to scratch requires hardware aperture setup and setup in the
644 kernel prologue (see :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`). Flat
645 access to LDS requires hardware aperture setup and M0 (GFX7-GFX8) register
646 setup (see :ref:`amdgpu-amdhsa-kernel-prolog-m0`).
648 To convert between a private or group address space address (termed a segment
649 address) and a flat address the base address of the corresponding aperture
650 can be used. For GFX7-GFX8 these are available in the
651 :ref:`amdgpu-amdhsa-hsa-aql-queue` the address of which can be obtained with
652 Queue Ptr SGPR (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). For
653 GFX9-GFX10 the aperture base addresses are directly available as inline
654 constant registers ``SRC_SHARED_BASE/LIMIT`` and ``SRC_PRIVATE_BASE/LIMIT``.
655 In 64-bit address mode the aperture sizes are 2^32 bytes and the base is
656 aligned to 2^32 which makes it easier to convert from flat to segment or
659 A global address space address has the same value when used as a flat address
660 so no conversion is needed.
662 **Global and Constant**
663 The global and constant address spaces both use global virtual addresses,
664 which are the same virtual address space used by the CPU. However, some
665 virtual addresses may only be accessible to the CPU, some only accessible
666 by the GPU, and some by both.
668 Using the constant address space indicates that the data will not change
669 during the execution of the kernel. This allows scalar read instructions to
670 be used. As the constant address space could only be modified on the host
671 side, a generic pointer loaded from the constant address space is safe to be
672 assumed as a global pointer since only the device global memory is visible
673 and managed on the host side. The vector and scalar L1 caches are invalidated
674 of volatile data before each kernel dispatch execution to allow constant
675 memory to change values between kernel dispatches.
678 The region address space uses the hardware Global Data Store (GDS). All
679 wavefronts executing on the same device will access the same memory for any
680 given region address. However, the same region address accessed by wavefronts
681 executing on different devices will access different memory. It is higher
682 performance than global memory. It is allocated by the runtime. The data
683 store (DS) instructions can be used to access it.
686 The local address space uses the hardware Local Data Store (LDS) which is
687 automatically allocated when the hardware creates the wavefronts of a
688 work-group, and freed when all the wavefronts of a work-group have
689 terminated. All wavefronts belonging to the same work-group will access the
690 same memory for any given local address. However, the same local address
691 accessed by wavefronts belonging to different work-groups will access
692 different memory. It is higher performance than global memory. The data store
693 (DS) instructions can be used to access it.
696 The private address space uses the hardware scratch memory support which
697 automatically allocates memory when it creates a wavefront and frees it when
698 a wavefronts terminates. The memory accessed by a lane of a wavefront for any
699 given private address will be different to the memory accessed by another lane
700 of the same or different wavefront for the same private address.
702 If a kernel dispatch uses scratch, then the hardware allocates memory from a
703 pool of backing memory allocated by the runtime for each wavefront. The lanes
704 of the wavefront access this using dword (4 byte) interleaving. The mapping
705 used from private address to backing memory address is:
707 ``wavefront-scratch-base +
708 ((private-address / 4) * wavefront-size * 4) +
709 (wavefront-lane-id * 4) + (private-address % 4)``
711 If each lane of a wavefront accesses the same private address, the
712 interleaving results in adjacent dwords being accessed and hence requires
713 fewer cache lines to be fetched.
715 There are different ways that the wavefront scratch base address is
716 determined by a wavefront (see
717 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
719 Scratch memory can be accessed in an interleaved manner using buffer
720 instructions with the scratch buffer descriptor and per wavefront scratch
721 offset, by the scratch instructions, or by flat instructions. Multi-dword
722 access is not supported except by flat and scratch instructions in
728 **Buffer Fat Pointer**
729 The buffer fat pointer is an experimental address space that is currently
730 unsupported in the backend. It exposes a non-integral pointer that is in
731 the future intended to support the modelling of 128-bit buffer descriptors
732 plus a 32-bit offset into the buffer (in total encapsulating a 160-bit
733 *pointer*), allowing normal LLVM load/store/atomic operations to be used to
734 model the buffer descriptors used heavily in graphics workloads targeting
737 .. _amdgpu-memory-scopes:
742 This section provides LLVM memory synchronization scopes supported by the AMDGPU
743 backend memory model when the target triple OS is ``amdhsa`` (see
744 :ref:`amdgpu-amdhsa-memory-model` and :ref:`amdgpu-target-triples`).
746 The memory model supported is based on the HSA memory model [HSA]_ which is
747 based in turn on HRF-indirect with scope inclusion [HRF]_. The happens-before
748 relation is transitive over the synchronizes-with relation independent of scope
749 and synchronizes-with allows the memory scope instances to be inclusive (see
750 table :ref:`amdgpu-amdhsa-llvm-sync-scopes-table`).
752 This is different to the OpenCL [OpenCL]_ memory model which does not have scope
753 inclusion and requires the memory scopes to exactly match. However, this
754 is conservatively correct for OpenCL.
756 .. table:: AMDHSA LLVM Sync Scopes
757 :name: amdgpu-amdhsa-llvm-sync-scopes-table
759 ======================= ===================================================
760 LLVM Sync Scope Description
761 ======================= ===================================================
762 *none* The default: ``system``.
764 Synchronizes with, and participates in modification
765 and seq_cst total orderings with, other operations
766 (except image operations) for all address spaces
767 (except private, or generic that accesses private)
768 provided the other operation's sync scope is:
771 - ``agent`` and executed by a thread on the same
773 - ``workgroup`` and executed by a thread in the
775 - ``wavefront`` and executed by a thread in the
778 ``agent`` Synchronizes with, and participates in modification
779 and seq_cst total orderings with, other operations
780 (except image operations) for all address spaces
781 (except private, or generic that accesses private)
782 provided the other operation's sync scope is:
784 - ``system`` or ``agent`` and executed by a thread
786 - ``workgroup`` and executed by a thread in the
788 - ``wavefront`` and executed by a thread in the
791 ``workgroup`` Synchronizes with, and participates in modification
792 and seq_cst total orderings with, other operations
793 (except image operations) for all address spaces
794 (except private, or generic that accesses private)
795 provided the other operation's sync scope is:
797 - ``system``, ``agent`` or ``workgroup`` and
798 executed by a thread in the same work-group.
799 - ``wavefront`` and executed by a thread in the
802 ``wavefront`` Synchronizes with, and participates in modification
803 and seq_cst total orderings with, other operations
804 (except image operations) for all address spaces
805 (except private, or generic that accesses private)
806 provided the other operation's sync scope is:
808 - ``system``, ``agent``, ``workgroup`` or
809 ``wavefront`` and executed by a thread in the
812 ``singlethread`` Only synchronizes with and participates in
813 modification and seq_cst total orderings with,
814 other operations (except image operations) running
815 in the same thread for all address spaces (for
816 example, in signal handlers).
818 ``one-as`` Same as ``system`` but only synchronizes with other
819 operations within the same address space.
821 ``agent-one-as`` Same as ``agent`` but only synchronizes with other
822 operations within the same address space.
824 ``workgroup-one-as`` Same as ``workgroup`` but only synchronizes with
825 other operations within the same address space.
827 ``wavefront-one-as`` Same as ``wavefront`` but only synchronizes with
828 other operations within the same address space.
830 ``singlethread-one-as`` Same as ``singlethread`` but only synchronizes with
831 other operations within the same address space.
832 ======================= ===================================================
837 The AMDGPU backend implements the following LLVM IR intrinsics.
839 *This section is WIP.*
843 List AMDGPU intrinsics.
848 The AMDGPU backend supports the following LLVM IR attributes.
850 .. table:: AMDGPU LLVM IR Attributes
851 :name: amdgpu-llvm-ir-attributes-table
853 ======================================= ==========================================================
854 LLVM Attribute Description
855 ======================================= ==========================================================
856 "amdgpu-flat-work-group-size"="min,max" Specify the minimum and maximum flat work group sizes that
857 will be specified when the kernel is dispatched. Generated
858 by the ``amdgpu_flat_work_group_size`` CLANG attribute [CLANG-ATTR]_.
859 "amdgpu-implicitarg-num-bytes"="n" Number of kernel argument bytes to add to the kernel
860 argument block size for the implicit arguments. This
861 varies by OS and language (for OpenCL see
862 :ref:`opencl-kernel-implicit-arguments-appended-for-amdhsa-os-table`).
863 "amdgpu-num-sgpr"="n" Specifies the number of SGPRs to use. Generated by
864 the ``amdgpu_num_sgpr`` CLANG attribute [CLANG-ATTR]_.
865 "amdgpu-num-vgpr"="n" Specifies the number of VGPRs to use. Generated by the
866 ``amdgpu_num_vgpr`` CLANG attribute [CLANG-ATTR]_.
867 "amdgpu-waves-per-eu"="m,n" Specify the minimum and maximum number of waves per
868 execution unit. Generated by the ``amdgpu_waves_per_eu``
869 CLANG attribute [CLANG-ATTR]_.
870 "amdgpu-ieee" true/false. Specify whether the function expects the IEEE field of the
871 mode register to be set on entry. Overrides the default for
872 the calling convention.
873 "amdgpu-dx10-clamp" true/false. Specify whether the function expects the DX10_CLAMP field of
874 the mode register to be set on entry. Overrides the default
875 for the calling convention.
876 ======================================= ==========================================================
878 .. _amdgpu-elf-code-object:
883 The AMDGPU backend generates a standard ELF [ELF]_ relocatable code object that
884 can be linked by ``lld`` to produce a standard ELF shared code object which can
885 be loaded and executed on an AMDGPU target.
887 .. _amdgpu-elf-header:
892 The AMDGPU backend uses the following ELF header:
894 .. table:: AMDGPU ELF Header
895 :name: amdgpu-elf-header-table
897 ========================== ===============================
899 ========================== ===============================
900 ``e_ident[EI_CLASS]`` ``ELFCLASS64``
901 ``e_ident[EI_DATA]`` ``ELFDATA2LSB``
902 ``e_ident[EI_OSABI]`` - ``ELFOSABI_NONE``
903 - ``ELFOSABI_AMDGPU_HSA``
904 - ``ELFOSABI_AMDGPU_PAL``
905 - ``ELFOSABI_AMDGPU_MESA3D``
906 ``e_ident[EI_ABIVERSION]`` - ``ELFABIVERSION_AMDGPU_HSA_V2``
907 - ``ELFABIVERSION_AMDGPU_HSA_V3``
908 - ``ELFABIVERSION_AMDGPU_HSA_V4``
909 - ``ELFABIVERSION_AMDGPU_PAL``
910 - ``ELFABIVERSION_AMDGPU_MESA3D``
911 ``e_type`` - ``ET_REL``
913 ``e_machine`` ``EM_AMDGPU``
915 ``e_flags`` See :ref:`amdgpu-elf-header-e_flags-v2-table`,
916 :ref:`amdgpu-elf-header-e_flags-table-v3`,
917 and :ref:`amdgpu-elf-header-e_flags-table-v4`
918 ========================== ===============================
922 .. table:: AMDGPU ELF Header Enumeration Values
923 :name: amdgpu-elf-header-enumeration-values-table
925 =============================== =====
927 =============================== =====
930 ``ELFOSABI_AMDGPU_HSA`` 64
931 ``ELFOSABI_AMDGPU_PAL`` 65
932 ``ELFOSABI_AMDGPU_MESA3D`` 66
933 ``ELFABIVERSION_AMDGPU_HSA_V2`` 0
934 ``ELFABIVERSION_AMDGPU_HSA_V3`` 1
935 ``ELFABIVERSION_AMDGPU_HSA_V4`` 2
936 ``ELFABIVERSION_AMDGPU_PAL`` 0
937 ``ELFABIVERSION_AMDGPU_MESA3D`` 0
938 =============================== =====
940 ``e_ident[EI_CLASS]``
943 * ``ELFCLASS32`` for ``r600`` architecture.
945 * ``ELFCLASS64`` for ``amdgcn`` architecture which only supports 64-bit
946 process address space applications.
949 All AMDGPU targets use ``ELFDATA2LSB`` for little-endian byte ordering.
951 ``e_ident[EI_OSABI]``
952 One of the following AMDGPU target architecture specific OS ABIs
953 (see :ref:`amdgpu-os`):
955 * ``ELFOSABI_NONE`` for *unknown* OS.
957 * ``ELFOSABI_AMDGPU_HSA`` for ``amdhsa`` OS.
959 * ``ELFOSABI_AMDGPU_PAL`` for ``amdpal`` OS.
961 * ``ELFOSABI_AMDGPU_MESA3D`` for ``mesa3D`` OS.
963 ``e_ident[EI_ABIVERSION]``
964 The ABI version of the AMDGPU target architecture specific OS ABI to which the code
967 * ``ELFABIVERSION_AMDGPU_HSA_V2`` is used to specify the version of AMD HSA
968 runtime ABI for code object V2. Specify using the Clang option
969 ``-mcode-object-version=2``.
971 * ``ELFABIVERSION_AMDGPU_HSA_V3`` is used to specify the version of AMD HSA
972 runtime ABI for code object V3. Specify using the Clang option
973 ``-mcode-object-version=3``. This is the default code object
974 version if not specified.
976 * ``ELFABIVERSION_AMDGPU_HSA_V4`` is used to specify the version of AMD HSA
977 runtime ABI for code object V4. Specify using the Clang option
978 ``-mcode-object-version=4``.
980 * ``ELFABIVERSION_AMDGPU_PAL`` is used to specify the version of AMD PAL
983 * ``ELFABIVERSION_AMDGPU_MESA3D`` is used to specify the version of AMD MESA
987 Can be one of the following values:
991 The type produced by the AMDGPU backend compiler as it is relocatable code
995 The type produced by the linker as it is a shared code object.
997 The AMD HSA runtime loader requires a ``ET_DYN`` code object.
1000 The value ``EM_AMDGPU`` is used for the machine for all processors supported
1001 by the ``r600`` and ``amdgcn`` architectures (see
1002 :ref:`amdgpu-processor-table`). The specific processor is specified in the
1003 ``NT_AMD_HSA_ISA_VERSION`` note record for code object V2 (see
1004 :ref:`amdgpu-note-records-v2`) and in the ``EF_AMDGPU_MACH`` bit field of the
1005 ``e_flags`` for code object V3 to V4 (see
1006 :ref:`amdgpu-elf-header-e_flags-table-v3` and
1007 :ref:`amdgpu-elf-header-e_flags-table-v4`).
1010 The entry point is 0 as the entry points for individual kernels must be
1011 selected in order to invoke them through AQL packets.
1014 The AMDGPU backend uses the following ELF header flags:
1016 .. table:: AMDGPU ELF Header ``e_flags`` for Code Object V2
1017 :name: amdgpu-elf-header-e_flags-v2-table
1019 ===================================== ===== =============================
1020 Name Value Description
1021 ===================================== ===== =============================
1022 ``EF_AMDGPU_FEATURE_XNACK_V2`` 0x01 Indicates if the ``xnack``
1024 enabled for all code
1025 contained in the code object.
1027 does not support the
1032 :ref:`amdgpu-target-features`.
1033 ``EF_AMDGPU_FEATURE_TRAP_HANDLER_V2`` 0x02 Indicates if the trap
1034 handler is enabled for all
1035 code contained in the code
1036 object. If the processor
1037 does not support a trap
1038 handler then must be 0.
1040 :ref:`amdgpu-target-features`.
1041 ===================================== ===== =============================
1043 .. table:: AMDGPU ELF Header ``e_flags`` for Code Object V3
1044 :name: amdgpu-elf-header-e_flags-table-v3
1046 ================================= ===== =============================
1047 Name Value Description
1048 ================================= ===== =============================
1049 ``EF_AMDGPU_MACH`` 0x0ff AMDGPU processor selection
1051 ``EF_AMDGPU_MACH_xxx`` values
1053 :ref:`amdgpu-ef-amdgpu-mach-table`.
1054 ``EF_AMDGPU_FEATURE_XNACK_V3`` 0x100 Indicates if the ``xnack``
1056 enabled for all code
1057 contained in the code object.
1059 does not support the
1064 :ref:`amdgpu-target-features`.
1065 ``EF_AMDGPU_FEATURE_SRAMECC_V3`` 0x200 Indicates if the ``sramecc``
1067 enabled for all code
1068 contained in the code object.
1070 does not support the
1075 :ref:`amdgpu-target-features`.
1076 ================================= ===== =============================
1078 .. table:: AMDGPU ELF Header ``e_flags`` for Code Object V4
1079 :name: amdgpu-elf-header-e_flags-table-v4
1081 ============================================ ===== ===================================
1082 Name Value Description
1083 ============================================ ===== ===================================
1084 ``EF_AMDGPU_MACH`` 0x0ff AMDGPU processor selection
1086 ``EF_AMDGPU_MACH_xxx`` values
1088 :ref:`amdgpu-ef-amdgpu-mach-table`.
1089 ``EF_AMDGPU_FEATURE_XNACK_V4`` 0x300 XNACK selection mask for
1090 ``EF_AMDGPU_FEATURE_XNACK_*_V4``
1092 ``EF_AMDGPU_FEATURE_XNACK_UNSUPPORTED_V4`` 0x000 XNACK unsuppored.
1093 ``EF_AMDGPU_FEATURE_XNACK_ANY_V4`` 0x100 XNACK can have any value.
1094 ``EF_AMDGPU_FEATURE_XNACK_OFF_V4`` 0x200 XNACK disabled.
1095 ``EF_AMDGPU_FEATURE_XNACK_ON_V4`` 0x300 XNACK enabled.
1096 ``EF_AMDGPU_FEATURE_SRAMECC_V4`` 0xc00 SRAMECC selection mask for
1097 ``EF_AMDGPU_FEATURE_SRAMECC_*_V4``
1099 ``EF_AMDGPU_FEATURE_SRAMECC_UNSUPPORTED_V4`` 0x000 SRAMECC unsuppored.
1100 ``EF_AMDGPU_FEATURE_SRAMECC_ANY_V4`` 0x400 SRAMECC can have any value.
1101 ``EF_AMDGPU_FEATURE_SRAMECC_OFF_V4`` 0x800 SRAMECC disabled,
1102 ``EF_AMDGPU_FEATURE_SRAMECC_ON_V4`` 0xc00 SRAMECC enabled.
1103 ============================================ ===== ===================================
1105 .. table:: AMDGPU ``EF_AMDGPU_MACH`` Values
1106 :name: amdgpu-ef-amdgpu-mach-table
1108 ==================================== ========== =============================
1109 Name Value Description (see
1110 :ref:`amdgpu-processor-table`)
1111 ==================================== ========== =============================
1112 ``EF_AMDGPU_MACH_NONE`` 0x000 *not specified*
1113 ``EF_AMDGPU_MACH_R600_R600`` 0x001 ``r600``
1114 ``EF_AMDGPU_MACH_R600_R630`` 0x002 ``r630``
1115 ``EF_AMDGPU_MACH_R600_RS880`` 0x003 ``rs880``
1116 ``EF_AMDGPU_MACH_R600_RV670`` 0x004 ``rv670``
1117 ``EF_AMDGPU_MACH_R600_RV710`` 0x005 ``rv710``
1118 ``EF_AMDGPU_MACH_R600_RV730`` 0x006 ``rv730``
1119 ``EF_AMDGPU_MACH_R600_RV770`` 0x007 ``rv770``
1120 ``EF_AMDGPU_MACH_R600_CEDAR`` 0x008 ``cedar``
1121 ``EF_AMDGPU_MACH_R600_CYPRESS`` 0x009 ``cypress``
1122 ``EF_AMDGPU_MACH_R600_JUNIPER`` 0x00a ``juniper``
1123 ``EF_AMDGPU_MACH_R600_REDWOOD`` 0x00b ``redwood``
1124 ``EF_AMDGPU_MACH_R600_SUMO`` 0x00c ``sumo``
1125 ``EF_AMDGPU_MACH_R600_BARTS`` 0x00d ``barts``
1126 ``EF_AMDGPU_MACH_R600_CAICOS`` 0x00e ``caicos``
1127 ``EF_AMDGPU_MACH_R600_CAYMAN`` 0x00f ``cayman``
1128 ``EF_AMDGPU_MACH_R600_TURKS`` 0x010 ``turks``
1129 *reserved* 0x011 - Reserved for ``r600``
1130 0x01f architecture processors.
1131 ``EF_AMDGPU_MACH_AMDGCN_GFX600`` 0x020 ``gfx600``
1132 ``EF_AMDGPU_MACH_AMDGCN_GFX601`` 0x021 ``gfx601``
1133 ``EF_AMDGPU_MACH_AMDGCN_GFX700`` 0x022 ``gfx700``
1134 ``EF_AMDGPU_MACH_AMDGCN_GFX701`` 0x023 ``gfx701``
1135 ``EF_AMDGPU_MACH_AMDGCN_GFX702`` 0x024 ``gfx702``
1136 ``EF_AMDGPU_MACH_AMDGCN_GFX703`` 0x025 ``gfx703``
1137 ``EF_AMDGPU_MACH_AMDGCN_GFX704`` 0x026 ``gfx704``
1138 *reserved* 0x027 Reserved.
1139 ``EF_AMDGPU_MACH_AMDGCN_GFX801`` 0x028 ``gfx801``
1140 ``EF_AMDGPU_MACH_AMDGCN_GFX802`` 0x029 ``gfx802``
1141 ``EF_AMDGPU_MACH_AMDGCN_GFX803`` 0x02a ``gfx803``
1142 ``EF_AMDGPU_MACH_AMDGCN_GFX810`` 0x02b ``gfx810``
1143 ``EF_AMDGPU_MACH_AMDGCN_GFX900`` 0x02c ``gfx900``
1144 ``EF_AMDGPU_MACH_AMDGCN_GFX902`` 0x02d ``gfx902``
1145 ``EF_AMDGPU_MACH_AMDGCN_GFX904`` 0x02e ``gfx904``
1146 ``EF_AMDGPU_MACH_AMDGCN_GFX906`` 0x02f ``gfx906``
1147 ``EF_AMDGPU_MACH_AMDGCN_GFX908`` 0x030 ``gfx908``
1148 ``EF_AMDGPU_MACH_AMDGCN_GFX909`` 0x031 ``gfx909``
1149 ``EF_AMDGPU_MACH_AMDGCN_GFX90C`` 0x032 ``gfx90c``
1150 ``EF_AMDGPU_MACH_AMDGCN_GFX1010`` 0x033 ``gfx1010``
1151 ``EF_AMDGPU_MACH_AMDGCN_GFX1011`` 0x034 ``gfx1011``
1152 ``EF_AMDGPU_MACH_AMDGCN_GFX1012`` 0x035 ``gfx1012``
1153 ``EF_AMDGPU_MACH_AMDGCN_GFX1030`` 0x036 ``gfx1030``
1154 ``EF_AMDGPU_MACH_AMDGCN_GFX1031`` 0x037 ``gfx1031``
1155 ``EF_AMDGPU_MACH_AMDGCN_GFX1032`` 0x038 ``gfx1032``
1156 ``EF_AMDGPU_MACH_AMDGCN_GFX1033`` 0x039 ``gfx1033``
1157 ``EF_AMDGPU_MACH_AMDGCN_GFX602`` 0x03a ``gfx602``
1158 ``EF_AMDGPU_MACH_AMDGCN_GFX705`` 0x03b ``gfx705``
1159 ``EF_AMDGPU_MACH_AMDGCN_GFX805`` 0x03c ``gfx805``
1160 ``EF_AMDGPU_MACH_AMDGCN_GFX1035`` 0x03d ``gfx1035``
1161 ``EF_AMDGPU_MACH_AMDGCN_GFX1034`` 0x03e ``gfx1034``
1162 ``EF_AMDGPU_MACH_AMDGCN_GFX90A`` 0x03f ``gfx90a``
1163 *reserved* 0x040 Reserved.
1164 *reserved* 0x041 Reserved.
1165 ``EF_AMDGPU_MACH_AMDGCN_GFX1013`` 0x042 ``gfx1013``
1166 *reserved* 0x043 Reserved.
1167 *reserved* 0x044 Reserved.
1168 *reserved* 0x045 Reserved.
1169 ==================================== ========== =============================
1174 An AMDGPU target ELF code object has the standard ELF sections which include:
1176 .. table:: AMDGPU ELF Sections
1177 :name: amdgpu-elf-sections-table
1179 ================== ================ =================================
1180 Name Type Attributes
1181 ================== ================ =================================
1182 ``.bss`` ``SHT_NOBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
1183 ``.data`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
1184 ``.debug_``\ *\** ``SHT_PROGBITS`` *none*
1185 ``.dynamic`` ``SHT_DYNAMIC`` ``SHF_ALLOC``
1186 ``.dynstr`` ``SHT_PROGBITS`` ``SHF_ALLOC``
1187 ``.dynsym`` ``SHT_PROGBITS`` ``SHF_ALLOC``
1188 ``.got`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
1189 ``.hash`` ``SHT_HASH`` ``SHF_ALLOC``
1190 ``.note`` ``SHT_NOTE`` *none*
1191 ``.rela``\ *name* ``SHT_RELA`` *none*
1192 ``.rela.dyn`` ``SHT_RELA`` *none*
1193 ``.rodata`` ``SHT_PROGBITS`` ``SHF_ALLOC``
1194 ``.shstrtab`` ``SHT_STRTAB`` *none*
1195 ``.strtab`` ``SHT_STRTAB`` *none*
1196 ``.symtab`` ``SHT_SYMTAB`` *none*
1197 ``.text`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_EXECINSTR``
1198 ================== ================ =================================
1200 These sections have their standard meanings (see [ELF]_) and are only generated
1204 The standard DWARF sections. See :ref:`amdgpu-dwarf-debug-information` for
1205 information on the DWARF produced by the AMDGPU backend.
1207 ``.dynamic``, ``.dynstr``, ``.dynsym``, ``.hash``
1208 The standard sections used by a dynamic loader.
1211 See :ref:`amdgpu-note-records` for the note records supported by the AMDGPU
1214 ``.rela``\ *name*, ``.rela.dyn``
1215 For relocatable code objects, *name* is the name of the section that the
1216 relocation records apply. For example, ``.rela.text`` is the section name for
1217 relocation records associated with the ``.text`` section.
1219 For linked shared code objects, ``.rela.dyn`` contains all the relocation
1220 records from each of the relocatable code object's ``.rela``\ *name* sections.
1222 See :ref:`amdgpu-relocation-records` for the relocation records supported by
1226 The executable machine code for the kernels and functions they call. Generated
1227 as position independent code. See :ref:`amdgpu-code-conventions` for
1228 information on conventions used in the isa generation.
1230 .. _amdgpu-note-records:
1235 The AMDGPU backend code object contains ELF note records in the ``.note``
1236 section. The set of generated notes and their semantics depend on the code
1237 object version; see :ref:`amdgpu-note-records-v2` and
1238 :ref:`amdgpu-note-records-v3-v4`.
1240 As required by ``ELFCLASS32`` and ``ELFCLASS64``, minimal zero-byte padding
1241 must be generated after the ``name`` field to ensure the ``desc`` field is 4
1242 byte aligned. In addition, minimal zero-byte padding must be generated to
1243 ensure the ``desc`` field size is a multiple of 4 bytes. The ``sh_addralign``
1244 field of the ``.note`` section must be at least 4 to indicate at least 8 byte
1247 .. _amdgpu-note-records-v2:
1249 Code Object V2 Note Records
1250 ~~~~~~~~~~~~~~~~~~~~~~~~~~~
1253 Code object V2 is not the default code object version emitted by
1254 this version of LLVM.
1256 The AMDGPU backend code object uses the following ELF note record in the
1257 ``.note`` section when compiling for code object V2.
1259 The note record vendor field is "AMD".
1261 Additional note records may be present, but any which are not documented here
1262 are deprecated and should not be used.
1264 .. table:: AMDGPU Code Object V2 ELF Note Records
1265 :name: amdgpu-elf-note-records-v2-table
1267 ===== ===================================== ======================================
1268 Name Type Description
1269 ===== ===================================== ======================================
1270 "AMD" ``NT_AMD_HSA_CODE_OBJECT_VERSION`` Code object version.
1271 "AMD" ``NT_AMD_HSA_HSAIL`` HSAIL properties generated by the HSAIL
1272 Finalizer and not the LLVM compiler.
1273 "AMD" ``NT_AMD_HSA_ISA_VERSION`` Target ISA version.
1274 "AMD" ``NT_AMD_HSA_METADATA`` Metadata null terminated string in
1275 YAML [YAML]_ textual format.
1276 "AMD" ``NT_AMD_HSA_ISA_NAME`` Target ISA name.
1277 ===== ===================================== ======================================
1281 .. table:: AMDGPU Code Object V2 ELF Note Record Enumeration Values
1282 :name: amdgpu-elf-note-record-enumeration-values-v2-table
1284 ===================================== =====
1286 ===================================== =====
1287 ``NT_AMD_HSA_CODE_OBJECT_VERSION`` 1
1288 ``NT_AMD_HSA_HSAIL`` 2
1289 ``NT_AMD_HSA_ISA_VERSION`` 3
1291 ``NT_AMD_HSA_METADATA`` 10
1292 ``NT_AMD_HSA_ISA_NAME`` 11
1293 ===================================== =====
1295 ``NT_AMD_HSA_CODE_OBJECT_VERSION``
1296 Specifies the code object version number. The description field has the
1301 struct amdgpu_hsa_note_code_object_version_s {
1302 uint32_t major_version;
1303 uint32_t minor_version;
1306 The ``major_version`` has a value less than or equal to 2.
1308 ``NT_AMD_HSA_HSAIL``
1309 Specifies the HSAIL properties used by the HSAIL Finalizer. The description
1310 field has the following layout:
1314 struct amdgpu_hsa_note_hsail_s {
1315 uint32_t hsail_major_version;
1316 uint32_t hsail_minor_version;
1318 uint8_t machine_model;
1319 uint8_t default_float_round;
1322 ``NT_AMD_HSA_ISA_VERSION``
1323 Specifies the target ISA version. The description field has the following layout:
1327 struct amdgpu_hsa_note_isa_s {
1328 uint16_t vendor_name_size;
1329 uint16_t architecture_name_size;
1333 char vendor_and_architecture_name[1];
1336 ``vendor_name_size`` and ``architecture_name_size`` are the length of the
1337 vendor and architecture names respectively, including the NUL character.
1339 ``vendor_and_architecture_name`` contains the NUL terminates string for the
1340 vendor, immediately followed by the NUL terminated string for the
1343 This note record is used by the HSA runtime loader.
1345 Code object V2 only supports a limited number of processors and has fixed
1346 settings for target features. See
1347 :ref:`amdgpu-elf-note-record-supported_processors-v2-table` for a list of
1348 processors and the corresponding target ID. In the table the note record ISA
1349 name is a concatenation of the vendor name, architecture name, major, minor,
1350 and stepping separated by a ":".
1352 The target ID column shows the processor name and fixed target features used
1353 by the LLVM compiler. The LLVM compiler does not generate a
1354 ``NT_AMD_HSA_HSAIL`` note record.
1356 A code object generated by the Finalizer also uses code object V2 and always
1357 generates a ``NT_AMD_HSA_HSAIL`` note record. The processor name and
1358 ``sramecc`` target feature is as shown in
1359 :ref:`amdgpu-elf-note-record-supported_processors-v2-table` but the ``xnack``
1360 target feature is specified by the ``EF_AMDGPU_FEATURE_XNACK_V2`` ``e_flags``
1363 ``NT_AMD_HSA_ISA_NAME``
1364 Specifies the target ISA name as a non-NUL terminated string.
1366 This note record is not used by the HSA runtime loader.
1368 See the ``NT_AMD_HSA_ISA_VERSION`` note record description of the code object
1369 V2's limited support of processors and fixed settings for target features.
1371 See :ref:`amdgpu-elf-note-record-supported_processors-v2-table` for a mapping
1372 from the string to the corresponding target ID. If the ``xnack`` target
1373 feature is supported and enabled, the string produced by the LLVM compiler
1374 will may have a ``+xnack`` appended. The Finlizer did not do the appending and
1375 instead used the ``EF_AMDGPU_FEATURE_XNACK_V2`` ``e_flags`` bit.
1377 ``NT_AMD_HSA_METADATA``
1378 Specifies extensible metadata associated with the code objects executed on HSA
1379 [HSA]_ compatible runtimes (see :ref:`amdgpu-os`). It is required when the
1380 target triple OS is ``amdhsa`` (see :ref:`amdgpu-target-triples`). See
1381 :ref:`amdgpu-amdhsa-code-object-metadata-v2` for the syntax of the code object
1384 .. table:: AMDGPU Code Object V2 Supported Processors and Fixed Target Feature Settings
1385 :name: amdgpu-elf-note-record-supported_processors-v2-table
1387 ===================== ==========================
1388 Note Record ISA Name Target ID
1389 ===================== ==========================
1390 ``AMD:AMDGPU:6:0:0`` ``gfx600``
1391 ``AMD:AMDGPU:6:0:1`` ``gfx601``
1392 ``AMD:AMDGPU:6:0:2`` ``gfx602``
1393 ``AMD:AMDGPU:7:0:0`` ``gfx700``
1394 ``AMD:AMDGPU:7:0:1`` ``gfx701``
1395 ``AMD:AMDGPU:7:0:2`` ``gfx702``
1396 ``AMD:AMDGPU:7:0:3`` ``gfx703``
1397 ``AMD:AMDGPU:7:0:4`` ``gfx704``
1398 ``AMD:AMDGPU:7:0:5`` ``gfx705``
1399 ``AMD:AMDGPU:8:0:0`` ``gfx802``
1400 ``AMD:AMDGPU:8:0:1`` ``gfx801:xnack+``
1401 ``AMD:AMDGPU:8:0:2`` ``gfx802``
1402 ``AMD:AMDGPU:8:0:3`` ``gfx803``
1403 ``AMD:AMDGPU:8:0:4`` ``gfx803``
1404 ``AMD:AMDGPU:8:0:5`` ``gfx805``
1405 ``AMD:AMDGPU:8:1:0`` ``gfx810:xnack+``
1406 ``AMD:AMDGPU:9:0:0`` ``gfx900:xnack-``
1407 ``AMD:AMDGPU:9:0:1`` ``gfx900:xnack+``
1408 ``AMD:AMDGPU:9:0:2`` ``gfx902:xnack-``
1409 ``AMD:AMDGPU:9:0:3`` ``gfx902:xnack+``
1410 ``AMD:AMDGPU:9:0:4`` ``gfx904:xnack-``
1411 ``AMD:AMDGPU:9:0:5`` ``gfx904:xnack+``
1412 ``AMD:AMDGPU:9:0:6`` ``gfx906:sramecc-:xnack-``
1413 ``AMD:AMDGPU:9:0:7`` ``gfx906:sramecc-:xnack+``
1414 ``AMD:AMDGPU:9:0:12`` ``gfx90c:xnack-``
1415 ===================== ==========================
1417 .. _amdgpu-note-records-v3-v4:
1419 Code Object V3 to V4 Note Records
1420 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1422 The AMDGPU backend code object uses the following ELF note record in the
1423 ``.note`` section when compiling for code object V3 to V4.
1425 The note record vendor field is "AMDGPU".
1427 Additional note records may be present, but any which are not documented here
1428 are deprecated and should not be used.
1430 .. table:: AMDGPU Code Object V3 to V4 ELF Note Records
1431 :name: amdgpu-elf-note-records-table-v3-v4
1433 ======== ============================== ======================================
1434 Name Type Description
1435 ======== ============================== ======================================
1436 "AMDGPU" ``NT_AMDGPU_METADATA`` Metadata in Message Pack [MsgPack]_
1438 ======== ============================== ======================================
1442 .. table:: AMDGPU Code Object V3 to V4 ELF Note Record Enumeration Values
1443 :name: amdgpu-elf-note-record-enumeration-values-table-v3-v4
1445 ============================== =====
1447 ============================== =====
1449 ``NT_AMDGPU_METADATA`` 32
1450 ============================== =====
1452 ``NT_AMDGPU_METADATA``
1453 Specifies extensible metadata associated with an AMDGPU code object. It is
1454 encoded as a map in the Message Pack [MsgPack]_ binary data format. See
1455 :ref:`amdgpu-amdhsa-code-object-metadata-v3` and
1456 :ref:`amdgpu-amdhsa-code-object-metadata-v4` for the map keys defined for the
1464 Symbols include the following:
1466 .. table:: AMDGPU ELF Symbols
1467 :name: amdgpu-elf-symbols-table
1469 ===================== ================== ================ ==================
1470 Name Type Section Description
1471 ===================== ================== ================ ==================
1472 *link-name* ``STT_OBJECT`` - ``.data`` Global variable
1475 *link-name*\ ``.kd`` ``STT_OBJECT`` - ``.rodata`` Kernel descriptor
1476 *link-name* ``STT_FUNC`` - ``.text`` Kernel entry point
1477 *link-name* ``STT_OBJECT`` - SHN_AMDGPU_LDS Global variable in LDS
1478 ===================== ================== ================ ==================
1481 Global variables both used and defined by the compilation unit.
1483 If the symbol is defined in the compilation unit then it is allocated in the
1484 appropriate section according to if it has initialized data or is readonly.
1486 If the symbol is external then its section is ``STN_UNDEF`` and the loader
1487 will resolve relocations using the definition provided by another code object
1488 or explicitly defined by the runtime.
1490 If the symbol resides in local/group memory (LDS) then its section is the
1491 special processor specific section name ``SHN_AMDGPU_LDS``, and the
1492 ``st_value`` field describes alignment requirements as it does for common
1497 Add description of linked shared object symbols. Seems undefined symbols
1498 are marked as STT_NOTYPE.
1501 Every HSA kernel has an associated kernel descriptor. It is the address of the
1502 kernel descriptor that is used in the AQL dispatch packet used to invoke the
1503 kernel, not the kernel entry point. The layout of the HSA kernel descriptor is
1504 defined in :ref:`amdgpu-amdhsa-kernel-descriptor`.
1507 Every HSA kernel also has a symbol for its machine code entry point.
1509 .. _amdgpu-relocation-records:
1514 AMDGPU backend generates ``Elf64_Rela`` relocation records. Supported
1515 relocatable fields are:
1518 This specifies a 32-bit field occupying 4 bytes with arbitrary byte
1519 alignment. These values use the same byte order as other word values in the
1520 AMDGPU architecture.
1523 This specifies a 64-bit field occupying 8 bytes with arbitrary byte
1524 alignment. These values use the same byte order as other word values in the
1525 AMDGPU architecture.
1527 Following notations are used for specifying relocation calculations:
1530 Represents the addend used to compute the value of the relocatable field.
1533 Represents the offset into the global offset table at which the relocation
1534 entry's symbol will reside during execution.
1537 Represents the address of the global offset table.
1540 Represents the place (section offset for ``et_rel`` or address for ``et_dyn``)
1541 of the storage unit being relocated (computed using ``r_offset``).
1544 Represents the value of the symbol whose index resides in the relocation
1545 entry. Relocations not using this must specify a symbol index of
1549 Represents the base address of a loaded executable or shared object which is
1550 the difference between the ELF address and the actual load address.
1551 Relocations using this are only valid in executable or shared objects.
1553 The following relocation types are supported:
1555 .. table:: AMDGPU ELF Relocation Records
1556 :name: amdgpu-elf-relocation-records-table
1558 ========================== ======= ===== ========== ==============================
1559 Relocation Type Kind Value Field Calculation
1560 ========================== ======= ===== ========== ==============================
1561 ``R_AMDGPU_NONE`` 0 *none* *none*
1562 ``R_AMDGPU_ABS32_LO`` Static, 1 ``word32`` (S + A) & 0xFFFFFFFF
1564 ``R_AMDGPU_ABS32_HI`` Static, 2 ``word32`` (S + A) >> 32
1566 ``R_AMDGPU_ABS64`` Static, 3 ``word64`` S + A
1568 ``R_AMDGPU_REL32`` Static 4 ``word32`` S + A - P
1569 ``R_AMDGPU_REL64`` Static 5 ``word64`` S + A - P
1570 ``R_AMDGPU_ABS32`` Static, 6 ``word32`` S + A
1572 ``R_AMDGPU_GOTPCREL`` Static 7 ``word32`` G + GOT + A - P
1573 ``R_AMDGPU_GOTPCREL32_LO`` Static 8 ``word32`` (G + GOT + A - P) & 0xFFFFFFFF
1574 ``R_AMDGPU_GOTPCREL32_HI`` Static 9 ``word32`` (G + GOT + A - P) >> 32
1575 ``R_AMDGPU_REL32_LO`` Static 10 ``word32`` (S + A - P) & 0xFFFFFFFF
1576 ``R_AMDGPU_REL32_HI`` Static 11 ``word32`` (S + A - P) >> 32
1578 ``R_AMDGPU_RELATIVE64`` Dynamic 13 ``word64`` B + A
1579 ``R_AMDGPU_REL16`` Static 14 ``word16`` ((S + A - P) - 4) / 4
1580 ========================== ======= ===== ========== ==============================
1582 ``R_AMDGPU_ABS32_LO`` and ``R_AMDGPU_ABS32_HI`` are only supported by
1583 the ``mesa3d`` OS, which does not support ``R_AMDGPU_ABS64``.
1585 There is no current OS loader support for 32-bit programs and so
1586 ``R_AMDGPU_ABS32`` is not used.
1588 .. _amdgpu-loaded-code-object-path-uniform-resource-identifier:
1590 Loaded Code Object Path Uniform Resource Identifier (URI)
1591 ---------------------------------------------------------
1593 The AMD GPU code object loader represents the path of the ELF shared object from
1594 which the code object was loaded as a textual Unifom Resource Identifier (URI).
1595 Note that the code object is the in memory loaded relocated form of the ELF
1596 shared object. Multiple code objects may be loaded at different memory
1597 addresses in the same process from the same ELF shared object.
1599 The loaded code object path URI syntax is defined by the following BNF syntax:
1603 code_object_uri ::== file_uri | memory_uri
1604 file_uri ::== "file://" file_path [ range_specifier ]
1605 memory_uri ::== "memory://" process_id range_specifier
1606 range_specifier ::== [ "#" | "?" ] "offset=" number "&" "size=" number
1607 file_path ::== URI_ENCODED_OS_FILE_PATH
1608 process_id ::== DECIMAL_NUMBER
1609 number ::== HEX_NUMBER | DECIMAL_NUMBER | OCTAL_NUMBER
1612 Is a C integral literal where hexadecimal values are prefixed by "0x" or "0X",
1613 and octal values by "0".
1616 Is the file's path specified as a URI encoded UTF-8 string. In URI encoding,
1617 every character that is not in the regular expression ``[a-zA-Z0-9/_.~-]`` is
1618 encoded as two uppercase hexadecimal digits proceeded by "%". Directories in
1619 the path are separated by "/".
1622 Is a 0-based byte offset to the start of the code object. For a file URI, it
1623 is from the start of the file specified by the ``file_path``, and if omitted
1624 defaults to 0. For a memory URI, it is the memory address and is required.
1627 Is the number of bytes in the code object. For a file URI, if omitted it
1628 defaults to the size of the file. It is required for a memory URI.
1631 Is the identity of the process owning the memory. For Linux it is the C
1632 unsigned integral decimal literal for the process ID (PID).
1638 file:///dir1/dir2/file1
1639 file:///dir3/dir4/file2#offset=0x2000&size=3000
1640 memory://1234#offset=0x20000&size=3000
1642 .. _amdgpu-dwarf-debug-information:
1644 DWARF Debug Information
1645 =======================
1649 This section describes **provisional support** for AMDGPU DWARF [DWARF]_ that
1650 is not currently fully implemented and is subject to change.
1652 AMDGPU generates DWARF [DWARF]_ debugging information ELF sections (see
1653 :ref:`amdgpu-elf-code-object`) which contain information that maps the code
1654 object executable code and data to the source language constructs. It can be
1655 used by tools such as debuggers and profilers. It uses features defined in
1656 :doc:`AMDGPUDwarfExtensionsForHeterogeneousDebugging` that are made available in
1657 DWARF Version 4 and DWARF Version 5 as an LLVM vendor extension.
1659 This section defines the AMDGPU target architecture specific DWARF mappings.
1661 .. _amdgpu-dwarf-register-identifier:
1666 This section defines the AMDGPU target architecture register numbers used in
1667 DWARF operation expressions (see DWARF Version 5 section 2.5 and
1668 :ref:`amdgpu-dwarf-operation-expressions`) and Call Frame Information
1669 instructions (see DWARF Version 5 section 6.4 and
1670 :ref:`amdgpu-dwarf-call-frame-information`).
1672 A single code object can contain code for kernels that have different wavefront
1673 sizes. The vector registers and some scalar registers are based on the wavefront
1674 size. AMDGPU defines distinct DWARF registers for each wavefront size. This
1675 simplifies the consumer of the DWARF so that each register has a fixed size,
1676 rather than being dynamic according to the wavefront size mode. Similarly,
1677 distinct DWARF registers are defined for those registers that vary in size
1678 according to the process address size. This allows a consumer to treat a
1679 specific AMDGPU processor as a single architecture regardless of how it is
1680 configured at run time. The compiler explicitly specifies the DWARF registers
1681 that match the mode in which the code it is generating will be executed.
1683 DWARF registers are encoded as numbers, which are mapped to architecture
1684 registers. The mapping for AMDGPU is defined in
1685 :ref:`amdgpu-dwarf-register-mapping-table`. All AMDGPU targets use the same
1688 .. table:: AMDGPU DWARF Register Mapping
1689 :name: amdgpu-dwarf-register-mapping-table
1691 ============== ================= ======== ==================================
1692 DWARF Register AMDGPU Register Bit Size Description
1693 ============== ================= ======== ==================================
1694 0 PC_32 32 Program Counter (PC) when
1695 executing in a 32-bit process
1696 address space. Used in the CFI to
1697 describe the PC of the calling
1699 1 EXEC_MASK_32 32 Execution Mask Register when
1700 executing in wavefront 32 mode.
1701 2-15 *Reserved* *Reserved for highly accessed
1702 registers using DWARF shortcut.*
1703 16 PC_64 64 Program Counter (PC) when
1704 executing in a 64-bit process
1705 address space. Used in the CFI to
1706 describe the PC of the calling
1708 17 EXEC_MASK_64 64 Execution Mask Register when
1709 executing in wavefront 64 mode.
1710 18-31 *Reserved* *Reserved for highly accessed
1711 registers using DWARF shortcut.*
1712 32-95 SGPR0-SGPR63 32 Scalar General Purpose
1714 96-127 *Reserved* *Reserved for frequently accessed
1715 registers using DWARF 1-byte ULEB.*
1716 128 STATUS 32 Status Register.
1717 129-511 *Reserved* *Reserved for future Scalar
1718 Architectural Registers.*
1719 512 VCC_32 32 Vector Condition Code Register
1720 when executing in wavefront 32
1722 513-1023 *Reserved* *Reserved for future Vector
1723 Architectural Registers when
1724 executing in wavefront 32 mode.*
1725 768 VCC_64 64 Vector Condition Code Register
1726 when executing in wavefront 64
1728 769-1023 *Reserved* *Reserved for future Vector
1729 Architectural Registers when
1730 executing in wavefront 64 mode.*
1731 1024-1087 *Reserved* *Reserved for padding.*
1732 1088-1129 SGPR64-SGPR105 32 Scalar General Purpose Registers.
1733 1130-1535 *Reserved* *Reserved for future Scalar
1734 General Purpose Registers.*
1735 1536-1791 VGPR0-VGPR255 32*32 Vector General Purpose Registers
1736 when executing in wavefront 32
1738 1792-2047 *Reserved* *Reserved for future Vector
1739 General Purpose Registers when
1740 executing in wavefront 32 mode.*
1741 2048-2303 AGPR0-AGPR255 32*32 Vector Accumulation Registers
1742 when executing in wavefront 32
1744 2304-2559 *Reserved* *Reserved for future Vector
1745 Accumulation Registers when
1746 executing in wavefront 32 mode.*
1747 2560-2815 VGPR0-VGPR255 64*32 Vector General Purpose Registers
1748 when executing in wavefront 64
1750 2816-3071 *Reserved* *Reserved for future Vector
1751 General Purpose Registers when
1752 executing in wavefront 64 mode.*
1753 3072-3327 AGPR0-AGPR255 64*32 Vector Accumulation Registers
1754 when executing in wavefront 64
1756 3328-3583 *Reserved* *Reserved for future Vector
1757 Accumulation Registers when
1758 executing in wavefront 64 mode.*
1759 ============== ================= ======== ==================================
1761 The vector registers are represented as the full size for the wavefront. They
1762 are organized as consecutive dwords (32-bits), one per lane, with the dword at
1763 the least significant bit position corresponding to lane 0 and so forth. DWARF
1764 location expressions involving the ``DW_OP_LLVM_offset`` and
1765 ``DW_OP_LLVM_push_lane`` operations are used to select the part of the vector
1766 register corresponding to the lane that is executing the current thread of
1767 execution in languages that are implemented using a SIMD or SIMT execution
1770 If the wavefront size is 32 lanes then the wavefront 32 mode register
1771 definitions are used. If the wavefront size is 64 lanes then the wavefront 64
1772 mode register definitions are used. Some AMDGPU targets support executing in
1773 both wavefront 32 and wavefront 64 mode. The register definitions corresponding
1774 to the wavefront mode of the generated code will be used.
1776 If code is generated to execute in a 32-bit process address space, then the
1777 32-bit process address space register definitions are used. If code is generated
1778 to execute in a 64-bit process address space, then the 64-bit process address
1779 space register definitions are used. The ``amdgcn`` target only supports the
1780 64-bit process address space.
1782 .. _amdgpu-dwarf-address-class-identifier:
1784 Address Class Identifier
1785 ------------------------
1787 The DWARF address class represents the source language memory space. See DWARF
1788 Version 5 section 2.12 which is updated by the *DWARF Extensions For
1789 Heterogeneous Debugging* section :ref:`amdgpu-dwarf-segment_addresses`.
1791 The DWARF address class mapping used for AMDGPU is defined in
1792 :ref:`amdgpu-dwarf-address-class-mapping-table`.
1794 .. table:: AMDGPU DWARF Address Class Mapping
1795 :name: amdgpu-dwarf-address-class-mapping-table
1797 ========================= ====== =================
1799 -------------------------------- -----------------
1800 Address Class Name Value Address Space
1801 ========================= ====== =================
1802 ``DW_ADDR_none`` 0x0000 Generic (Flat)
1803 ``DW_ADDR_LLVM_global`` 0x0001 Global
1804 ``DW_ADDR_LLVM_constant`` 0x0002 Global
1805 ``DW_ADDR_LLVM_group`` 0x0003 Local (group/LDS)
1806 ``DW_ADDR_LLVM_private`` 0x0004 Private (Scratch)
1807 ``DW_ADDR_AMDGPU_region`` 0x8000 Region (GDS)
1808 ========================= ====== =================
1810 The DWARF address class values defined in the *DWARF Extensions For
1811 Heterogeneous Debugging* section :ref:`amdgpu-dwarf-segment_addresses` are used.
1813 In addition, ``DW_ADDR_AMDGPU_region`` is encoded as a vendor extension. This is
1814 available for use for the AMD extension for access to the hardware GDS memory
1815 which is scratchpad memory allocated per device.
1817 For AMDGPU if no ``DW_AT_address_class`` attribute is present, then the default
1818 address class of ``DW_ADDR_none`` is used.
1820 See :ref:`amdgpu-dwarf-address-space-identifier` for information on the AMDGPU
1821 mapping of DWARF address classes to DWARF address spaces, including address size
1824 .. _amdgpu-dwarf-address-space-identifier:
1826 Address Space Identifier
1827 ------------------------
1829 DWARF address spaces correspond to target architecture specific linear
1830 addressable memory areas. See DWARF Version 5 section 2.12 and *DWARF Extensions
1831 For Heterogeneous Debugging* section :ref:`amdgpu-dwarf-segment_addresses`.
1833 The DWARF address space mapping used for AMDGPU is defined in
1834 :ref:`amdgpu-dwarf-address-space-mapping-table`.
1836 .. table:: AMDGPU DWARF Address Space Mapping
1837 :name: amdgpu-dwarf-address-space-mapping-table
1839 ======================================= ===== ======= ======== ================= =======================
1841 --------------------------------------- ----- ---------------- ----------------- -----------------------
1842 Address Space Name Value Address Bit Size Address Space
1843 --------------------------------------- ----- ------- -------- ----------------- -----------------------
1848 ======================================= ===== ======= ======== ================= =======================
1849 ``DW_ASPACE_none`` 0x00 64 32 Global *default address space*
1850 ``DW_ASPACE_AMDGPU_generic`` 0x01 64 32 Generic (Flat)
1851 ``DW_ASPACE_AMDGPU_region`` 0x02 32 32 Region (GDS)
1852 ``DW_ASPACE_AMDGPU_local`` 0x03 32 32 Local (group/LDS)
1854 ``DW_ASPACE_AMDGPU_private_lane`` 0x05 32 32 Private (Scratch) *focused lane*
1855 ``DW_ASPACE_AMDGPU_private_wave`` 0x06 32 32 Private (Scratch) *unswizzled wavefront*
1856 ======================================= ===== ======= ======== ================= =======================
1858 See :ref:`amdgpu-address-spaces` for information on the AMDGPU address spaces
1859 including address size and NULL value.
1861 The ``DW_ASPACE_none`` address space is the default target architecture address
1862 space used in DWARF operations that do not specify an address space. It
1863 therefore has to map to the global address space so that the ``DW_OP_addr*`` and
1864 related operations can refer to addresses in the program code.
1866 The ``DW_ASPACE_AMDGPU_generic`` address space allows location expressions to
1867 specify the flat address space. If the address corresponds to an address in the
1868 local address space, then it corresponds to the wavefront that is executing the
1869 focused thread of execution. If the address corresponds to an address in the
1870 private address space, then it corresponds to the lane that is executing the
1871 focused thread of execution for languages that are implemented using a SIMD or
1872 SIMT execution model.
1876 CUDA-like languages such as HIP that do not have address spaces in the
1877 language type system, but do allow variables to be allocated in different
1878 address spaces, need to explicitly specify the ``DW_ASPACE_AMDGPU_generic``
1879 address space in the DWARF expression operations as the default address space
1880 is the global address space.
1882 The ``DW_ASPACE_AMDGPU_local`` address space allows location expressions to
1883 specify the local address space corresponding to the wavefront that is executing
1884 the focused thread of execution.
1886 The ``DW_ASPACE_AMDGPU_private_lane`` address space allows location expressions
1887 to specify the private address space corresponding to the lane that is executing
1888 the focused thread of execution for languages that are implemented using a SIMD
1889 or SIMT execution model.
1891 The ``DW_ASPACE_AMDGPU_private_wave`` address space allows location expressions
1892 to specify the unswizzled private address space corresponding to the wavefront
1893 that is executing the focused thread of execution. The wavefront view of private
1894 memory is the per wavefront unswizzled backing memory layout defined in
1895 :ref:`amdgpu-address-spaces`, such that address 0 corresponds to the first
1896 location for the backing memory of the wavefront (namely the address is not
1897 offset by ``wavefront-scratch-base``). The following formula can be used to
1898 convert from a ``DW_ASPACE_AMDGPU_private_lane`` address to a
1899 ``DW_ASPACE_AMDGPU_private_wave`` address:
1903 private-address-wavefront =
1904 ((private-address-lane / 4) * wavefront-size * 4) +
1905 (wavefront-lane-id * 4) + (private-address-lane % 4)
1907 If the ``DW_ASPACE_AMDGPU_private_lane`` address is dword aligned, and the start
1908 of the dwords for each lane starting with lane 0 is required, then this
1913 private-address-wavefront =
1914 private-address-lane * wavefront-size
1916 A compiler can use the ``DW_ASPACE_AMDGPU_private_wave`` address space to read a
1917 complete spilled vector register back into a complete vector register in the
1918 CFI. The frame pointer can be a private lane address which is dword aligned,
1919 which can be shifted to multiply by the wavefront size, and then used to form a
1920 private wavefront address that gives a location for a contiguous set of dwords,
1921 one per lane, where the vector register dwords are spilled. The compiler knows
1922 the wavefront size since it generates the code. Note that the type of the
1923 address may have to be converted as the size of a
1924 ``DW_ASPACE_AMDGPU_private_lane`` address may be smaller than the size of a
1925 ``DW_ASPACE_AMDGPU_private_wave`` address.
1927 .. _amdgpu-dwarf-lane-identifier:
1932 DWARF lane identifies specify a target architecture lane position for hardware
1933 that executes in a SIMD or SIMT manner, and on which a source language maps its
1934 threads of execution onto those lanes. The DWARF lane identifier is pushed by
1935 the ``DW_OP_LLVM_push_lane`` DWARF expression operation. See DWARF Version 5
1936 section 2.5 which is updated by *DWARF Extensions For Heterogeneous Debugging*
1937 section :ref:`amdgpu-dwarf-operation-expressions`.
1939 For AMDGPU, the lane identifier corresponds to the hardware lane ID of a
1940 wavefront. It is numbered from 0 to the wavefront size minus 1.
1942 Operation Expressions
1943 ---------------------
1945 DWARF expressions are used to compute program values and the locations of
1946 program objects. See DWARF Version 5 section 2.5 and
1947 :ref:`amdgpu-dwarf-operation-expressions`.
1949 DWARF location descriptions describe how to access storage which includes memory
1950 and registers. When accessing storage on AMDGPU, bytes are ordered with least
1951 significant bytes first, and bits are ordered within bytes with least
1952 significant bits first.
1954 For AMDGPU CFI expressions, ``DW_OP_LLVM_select_bit_piece`` is used to describe
1955 unwinding vector registers that are spilled under the execution mask to memory:
1956 the zero-single location description is the vector register, and the one-single
1957 location description is the spilled memory location description. The
1958 ``DW_OP_LLVM_form_aspace_address`` is used to specify the address space of the
1959 memory location description.
1961 In AMDGPU expressions, ``DW_OP_LLVM_select_bit_piece`` is used by the
1962 ``DW_AT_LLVM_lane_pc`` attribute expression where divergent control flow is
1963 controlled by the execution mask. An undefined location description together
1964 with ``DW_OP_LLVM_extend`` is used to indicate the lane was not active on entry
1965 to the subprogram. See :ref:`amdgpu-dwarf-dw-at-llvm-lane-pc` for an example.
1967 Debugger Information Entry Attributes
1968 -------------------------------------
1970 This section describes how certain debugger information entry attributes are
1971 used by AMDGPU. See the sections in DWARF Version 5 section 2 which are updated
1972 by *DWARF Extensions For Heterogeneous Debugging* section
1973 :ref:`amdgpu-dwarf-debugging-information-entry-attributes`.
1975 .. _amdgpu-dwarf-dw-at-llvm-lane-pc:
1977 ``DW_AT_LLVM_lane_pc``
1978 ~~~~~~~~~~~~~~~~~~~~~~
1980 For AMDGPU, the ``DW_AT_LLVM_lane_pc`` attribute is used to specify the program
1981 location of the separate lanes of a SIMT thread.
1983 If the lane is an active lane then this will be the same as the current program
1986 If the lane is inactive, but was active on entry to the subprogram, then this is
1987 the program location in the subprogram at which execution of the lane is
1988 conceptual positioned.
1990 If the lane was not active on entry to the subprogram, then this will be the
1991 undefined location. A client debugger can check if the lane is part of a valid
1992 work-group by checking that the lane is in the range of the associated
1993 work-group within the grid, accounting for partial work-groups. If it is not,
1994 then the debugger can omit any information for the lane. Otherwise, the debugger
1995 may repeatedly unwind the stack and inspect the ``DW_AT_LLVM_lane_pc`` of the
1996 calling subprogram until it finds a non-undefined location. Conceptually the
1997 lane only has the call frames that it has a non-undefined
1998 ``DW_AT_LLVM_lane_pc``.
2000 The following example illustrates how the AMDGPU backend can generate a DWARF
2001 location list expression for the nested ``IF/THEN/ELSE`` structures of the
2002 following subprogram pseudo code for a target with 64 lanes per wavefront.
2024 The AMDGPU backend may generate the following pseudo LLVM MIR to manipulate the
2025 execution mask (``EXEC``) to linearize the control flow. The condition is
2026 evaluated to make a mask of the lanes for which the condition evaluates to true.
2027 First the ``THEN`` region is executed by setting the ``EXEC`` mask to the
2028 logical ``AND`` of the current ``EXEC`` mask with the condition mask. Then the
2029 ``ELSE`` region is executed by negating the ``EXEC`` mask and logical ``AND`` of
2030 the saved ``EXEC`` mask at the start of the region. After the ``IF/THEN/ELSE``
2031 region the ``EXEC`` mask is restored to the value it had at the beginning of the
2032 region. This is shown below. Other approaches are possible, but the basic
2033 concept is the same.
2066 To create the DWARF location list expression that defines the location
2067 description of a vector of lane program locations, the LLVM MIR ``DBG_VALUE``
2068 pseudo instruction can be used to annotate the linearized control flow. This can
2069 be done by defining an artificial variable for the lane PC. The DWARF location
2070 list expression created for it is used as the value of the
2071 ``DW_AT_LLVM_lane_pc`` attribute on the subprogram's debugger information entry.
2073 A DWARF procedure is defined for each well nested structured control flow region
2074 which provides the conceptual lane program location for a lane if it is not
2075 active (namely it is divergent). The DWARF operation expression for each region
2076 conceptually inherits the value of the immediately enclosing region and modifies
2077 it according to the semantics of the region.
2079 For an ``IF/THEN/ELSE`` region the divergent program location is at the start of
2080 the region for the ``THEN`` region since it is executed first. For the ``ELSE``
2081 region the divergent program location is at the end of the ``IF/THEN/ELSE``
2082 region since the ``THEN`` region has completed.
2084 The lane PC artificial variable is assigned at each region transition. It uses
2085 the immediately enclosing region's DWARF procedure to compute the program
2086 location for each lane assuming they are divergent, and then modifies the result
2087 by inserting the current program location for each lane that the ``EXEC`` mask
2088 indicates is active.
2090 By having separate DWARF procedures for each region, they can be reused to
2091 define the value for any nested region. This reduces the total size of the DWARF
2092 operation expressions.
2094 The following provides an example using pseudo LLVM MIR.
2100 DEFINE_DWARF %__uint_64 = DW_TAG_base_type[
2101 DW_AT_name = "__uint64";
2102 DW_AT_byte_size = 8;
2103 DW_AT_encoding = DW_ATE_unsigned;
2105 DEFINE_DWARF %__active_lane_pc = DW_TAG_dwarf_procedure[
2106 DW_AT_name = "__active_lane_pc";
2109 DW_OP_LLVM_extend 64, 64;
2110 DW_OP_regval_type EXEC, %uint_64;
2111 DW_OP_LLVM_select_bit_piece 64, 64;
2114 DEFINE_DWARF %__divergent_lane_pc = DW_TAG_dwarf_procedure[
2115 DW_AT_name = "__divergent_lane_pc";
2117 DW_OP_LLVM_undefined;
2118 DW_OP_LLVM_extend 64, 64;
2121 DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[
2122 DW_OP_call_ref %__divergent_lane_pc;
2123 DW_OP_call_ref %__active_lane_pc;
2127 DBG_VALUE %1, $noreg, %__lex_1_save_exec;
2132 DEFINE_DWARF %__divergent_lane_pc_1_then = DW_TAG_dwarf_procedure[
2133 DW_AT_name = "__divergent_lane_pc_1_then";
2134 DW_AT_location = DIExpression[
2135 DW_OP_call_ref %__divergent_lane_pc;
2136 DW_OP_addrx &lex_1_start;
2138 DW_OP_LLVM_extend 64, 64;
2139 DW_OP_call_ref %__lex_1_save_exec;
2140 DW_OP_deref_type 64, %__uint_64;
2141 DW_OP_LLVM_select_bit_piece 64, 64;
2144 DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[
2145 DW_OP_call_ref %__divergent_lane_pc_1_then;
2146 DW_OP_call_ref %__active_lane_pc;
2150 DBG_VALUE %3, %__lex_1_1_save_exec;
2155 DEFINE_DWARF %__divergent_lane_pc_1_1_then = DW_TAG_dwarf_procedure[
2156 DW_AT_name = "__divergent_lane_pc_1_1_then";
2157 DW_AT_location = DIExpression[
2158 DW_OP_call_ref %__divergent_lane_pc_1_then;
2159 DW_OP_addrx &lex_1_1_start;
2161 DW_OP_LLVM_extend 64, 64;
2162 DW_OP_call_ref %__lex_1_1_save_exec;
2163 DW_OP_deref_type 64, %__uint_64;
2164 DW_OP_LLVM_select_bit_piece 64, 64;
2167 DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[
2168 DW_OP_call_ref %__divergent_lane_pc_1_1_then;
2169 DW_OP_call_ref %__active_lane_pc;
2174 DEFINE_DWARF %__divergent_lane_pc_1_1_else = DW_TAG_dwarf_procedure[
2175 DW_AT_name = "__divergent_lane_pc_1_1_else";
2176 DW_AT_location = DIExpression[
2177 DW_OP_call_ref %__divergent_lane_pc_1_then;
2178 DW_OP_addrx &lex_1_1_end;
2180 DW_OP_LLVM_extend 64, 64;
2181 DW_OP_call_ref %__lex_1_1_save_exec;
2182 DW_OP_deref_type 64, %__uint_64;
2183 DW_OP_LLVM_select_bit_piece 64, 64;
2186 DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[
2187 DW_OP_call_ref %__divergent_lane_pc_1_1_else;
2188 DW_OP_call_ref %__active_lane_pc;
2193 DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[
2194 DW_OP_call_ref %__divergent_lane_pc;
2195 DW_OP_call_ref %__active_lane_pc;
2200 DEFINE_DWARF %__divergent_lane_pc_1_else = DW_TAG_dwarf_procedure[
2201 DW_AT_name = "__divergent_lane_pc_1_else";
2202 DW_AT_location = DIExpression[
2203 DW_OP_call_ref %__divergent_lane_pc;
2204 DW_OP_addrx &lex_1_end;
2206 DW_OP_LLVM_extend 64, 64;
2207 DW_OP_call_ref %__lex_1_save_exec;
2208 DW_OP_deref_type 64, %__uint_64;
2209 DW_OP_LLVM_select_bit_piece 64, 64;
2212 DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[
2213 DW_OP_call_ref %__divergent_lane_pc_1_else;
2214 DW_OP_call_ref %__active_lane_pc;
2219 DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc DIExpression[
2220 DW_OP_call_ref %__divergent_lane_pc;
2221 DW_OP_call_ref %__active_lane_pc;
2226 The DWARF procedure ``%__active_lane_pc`` is used to update the lane pc elements
2227 that are active, with the current program location.
2229 Artificial variables %__lex_1_save_exec and %__lex_1_1_save_exec are created for
2230 the execution masks saved on entry to a region. Using the ``DBG_VALUE`` pseudo
2231 instruction, location list entries will be created that describe where the
2232 artificial variables are allocated at any given program location. The compiler
2233 may allocate them to registers or spill them to memory.
2235 The DWARF procedures for each region use the values of the saved execution mask
2236 artificial variables to only update the lanes that are active on entry to the
2237 region. All other lanes retain the value of the enclosing region where they were
2238 last active. If they were not active on entry to the subprogram, then will have
2239 the undefined location description.
2241 Other structured control flow regions can be handled similarly. For example,
2242 loops would set the divergent program location for the region at the end of the
2243 loop. Any lanes active will be in the loop, and any lanes not active must have
2246 An ``IF/THEN/ELSEIF/ELSEIF/...`` region can be treated as a nest of
2247 ``IF/THEN/ELSE`` regions.
2249 The DWARF procedures can use the active lane artificial variable described in
2250 :ref:`amdgpu-dwarf-amdgpu-dw-at-llvm-active-lane` rather than the actual
2251 ``EXEC`` mask in order to support whole or quad wavefront mode.
2253 .. _amdgpu-dwarf-amdgpu-dw-at-llvm-active-lane:
2255 ``DW_AT_LLVM_active_lane``
2256 ~~~~~~~~~~~~~~~~~~~~~~~~~~
2258 The ``DW_AT_LLVM_active_lane`` attribute on a subprogram debugger information
2259 entry is used to specify the lanes that are conceptually active for a SIMT
2262 The execution mask may be modified to implement whole or quad wavefront mode
2263 operations. For example, all lanes may need to temporarily be made active to
2264 execute a whole wavefront operation. Such regions would save the ``EXEC`` mask,
2265 update it to enable the necessary lanes, perform the operations, and then
2266 restore the ``EXEC`` mask from the saved value. While executing the whole
2267 wavefront region, the conceptual execution mask is the saved value, not the
2270 This is handled by defining an artificial variable for the active lane mask. The
2271 active lane mask artificial variable would be the actual ``EXEC`` mask for
2272 normal regions, and the saved execution mask for regions where the mask is
2273 temporarily updated. The location list expression created for this artificial
2274 variable is used to define the value of the ``DW_AT_LLVM_active_lane``
2277 ``DW_AT_LLVM_augmentation``
2278 ~~~~~~~~~~~~~~~~~~~~~~~~~~~
2280 For AMDGPU, the ``DW_AT_LLVM_augmentation`` attribute of a compilation unit
2281 debugger information entry has the following value for the augmentation string:
2287 The "vX.Y" specifies the major X and minor Y version number of the AMDGPU
2288 extensions used in the DWARF of the compilation unit. The version number
2289 conforms to [SEMVER]_.
2291 Call Frame Information
2292 ----------------------
2294 DWARF Call Frame Information (CFI) describes how a consumer can virtually
2295 *unwind* call frames in a running process or core dump. See DWARF Version 5
2296 section 6.4 and :ref:`amdgpu-dwarf-call-frame-information`.
2298 For AMDGPU, the Common Information Entry (CIE) fields have the following values:
2300 1. ``augmentation`` string contains the following null-terminated UTF-8 string:
2306 The ``vX.Y`` specifies the major X and minor Y version number of the AMDGPU
2307 extensions used in this CIE or to the FDEs that use it. The version number
2308 conforms to [SEMVER]_.
2310 2. ``address_size`` for the ``Global`` address space is defined in
2311 :ref:`amdgpu-dwarf-address-space-identifier`.
2313 3. ``segment_selector_size`` is 0 as AMDGPU does not use a segment selector.
2315 4. ``code_alignment_factor`` is 4 bytes.
2319 Add to :ref:`amdgpu-processor-table` table.
2321 5. ``data_alignment_factor`` is 4 bytes.
2325 Add to :ref:`amdgpu-processor-table` table.
2327 6. ``return_address_register`` is ``PC_32`` for 32-bit processes and ``PC_64``
2328 for 64-bit processes defined in :ref:`amdgpu-dwarf-register-identifier`.
2330 7. ``initial_instructions`` Since a subprogram X with fewer registers can be
2331 called from subprogram Y that has more allocated, X will not change any of
2332 the extra registers as it cannot access them. Therefore, the default rule
2333 for all columns is ``same value``.
2335 For AMDGPU the register number follows the numbering defined in
2336 :ref:`amdgpu-dwarf-register-identifier`.
2338 For AMDGPU the instructions are variable size. A consumer can subtract 1 from
2339 the return address to get the address of a byte within the call site
2340 instructions. See DWARF Version 5 section 6.4.4.
2345 See DWARF Version 5 section 6.1.
2347 Lookup By Name Section Header
2348 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
2350 See DWARF Version 5 section 6.1.1.4.1 and :ref:`amdgpu-dwarf-lookup-by-name`.
2352 For AMDGPU the lookup by name section header table:
2354 ``augmentation_string_size`` (uword)
2356 Set to the length of the ``augmentation_string`` value which is always a
2359 ``augmentation_string`` (sequence of UTF-8 characters)
2361 Contains the following UTF-8 string null padded to a multiple of 4 bytes:
2367 The "vX.Y" specifies the major X and minor Y version number of the AMDGPU
2368 extensions used in the DWARF of this index. The version number conforms to
2373 This is different to the DWARF Version 5 definition that requires the first
2374 4 characters to be the vendor ID. But this is consistent with the other
2375 augmentation strings and does allow multiple vendor contributions. However,
2376 backwards compatibility may be more desirable.
2378 Lookup By Address Section Header
2379 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
2381 See DWARF Version 5 section 6.1.2.
2383 For AMDGPU the lookup by address section header table:
2385 ``address_size`` (ubyte)
2387 Match the address size for the ``Global`` address space defined in
2388 :ref:`amdgpu-dwarf-address-space-identifier`.
2390 ``segment_selector_size`` (ubyte)
2392 AMDGPU does not use a segment selector so this is 0. The entries in the
2393 ``.debug_aranges`` do not have a segment selector.
2395 Line Number Information
2396 -----------------------
2398 See DWARF Version 5 section 6.2 and :ref:`amdgpu-dwarf-line-number-information`.
2400 AMDGPU does not use the ``isa`` state machine registers and always sets it to 0.
2401 The instruction set must be obtained from the ELF file header ``e_flags`` field
2402 in the ``EF_AMDGPU_MACH`` bit position (see :ref:`ELF Header
2403 <amdgpu-elf-header>`). See DWARF Version 5 section 6.2.2.
2407 Should the ``isa`` state machine register be used to indicate if the code is
2408 in wavefront32 or wavefront64 mode? Or used to specify the architecture ISA?
2410 For AMDGPU the line number program header fields have the following values (see
2411 DWARF Version 5 section 6.2.4):
2413 ``address_size`` (ubyte)
2414 Matches the address size for the ``Global`` address space defined in
2415 :ref:`amdgpu-dwarf-address-space-identifier`.
2417 ``segment_selector_size`` (ubyte)
2418 AMDGPU does not use a segment selector so this is 0.
2420 ``minimum_instruction_length`` (ubyte)
2421 For GFX9-GFX10 this is 4.
2423 ``maximum_operations_per_instruction`` (ubyte)
2424 For GFX9-GFX10 this is 1.
2426 Source text for online-compiled programs (for example, those compiled by the
2427 OpenCL language runtime) may be embedded into the DWARF Version 5 line table.
2428 See DWARF Version 5 section 6.2.4.1 which is updated by *DWARF Extensions For
2429 Heterogeneous Debugging* section :ref:`DW_LNCT_LLVM_source
2430 <amdgpu-dwarf-line-number-information-dw-lnct-llvm-source>`.
2432 The Clang option used to control source embedding in AMDGPU is defined in
2433 :ref:`amdgpu-clang-debug-options-table`.
2435 .. table:: AMDGPU Clang Debug Options
2436 :name: amdgpu-clang-debug-options-table
2438 ==================== ==================================================
2439 Debug Flag Description
2440 ==================== ==================================================
2441 -g[no-]embed-source Enable/disable embedding source text in DWARF
2442 debug sections. Useful for environments where
2443 source cannot be written to disk, such as
2444 when performing online compilation.
2445 ==================== ==================================================
2450 Enable the embedded source.
2452 ``-gno-embed-source``
2453 Disable the embedded source.
2455 32-Bit and 64-Bit DWARF Formats
2456 -------------------------------
2458 See DWARF Version 5 section 7.4 and
2459 :ref:`amdgpu-dwarf-32-bit-and-64-bit-dwarf-formats`.
2463 * For the ``amdgcn`` target architecture only the 64-bit process address space
2466 * The producer can generate either 32-bit or 64-bit DWARF format. LLVM generates
2467 the 32-bit DWARF format.
2472 For AMDGPU the following values apply for each of the unit headers described in
2473 DWARF Version 5 sections 7.5.1.1, 7.5.1.2, and 7.5.1.3:
2475 ``address_size`` (ubyte)
2476 Matches the address size for the ``Global`` address space defined in
2477 :ref:`amdgpu-dwarf-address-space-identifier`.
2479 .. _amdgpu-code-conventions:
2484 This section provides code conventions used for each supported target triple OS
2485 (see :ref:`amdgpu-target-triples`).
2490 This section provides code conventions used when the target triple OS is
2491 ``amdhsa`` (see :ref:`amdgpu-target-triples`).
2493 .. _amdgpu-amdhsa-code-object-metadata:
2495 Code Object Metadata
2496 ~~~~~~~~~~~~~~~~~~~~
2498 The code object metadata specifies extensible metadata associated with the code
2499 objects executed on HSA [HSA]_ compatible runtimes (see :ref:`amdgpu-os`). The
2500 encoding and semantics of this metadata depends on the code object version; see
2501 :ref:`amdgpu-amdhsa-code-object-metadata-v2`,
2502 :ref:`amdgpu-amdhsa-code-object-metadata-v3`, and
2503 :ref:`amdgpu-amdhsa-code-object-metadata-v4`.
2505 Code object metadata is specified in a note record (see
2506 :ref:`amdgpu-note-records`) and is required when the target triple OS is
2507 ``amdhsa`` (see :ref:`amdgpu-target-triples`). It must contain the minimum
2508 information necessary to support the HSA compatible runtime kernel queries. For
2509 example, the segment sizes needed in a dispatch packet. In addition, a
2510 high-level language runtime may require other information to be included. For
2511 example, the AMD OpenCL runtime records kernel argument information.
2513 .. _amdgpu-amdhsa-code-object-metadata-v2:
2515 Code Object V2 Metadata
2516 +++++++++++++++++++++++
2519 Code object V2 is not the default code object version emitted by this version
2522 Code object V2 metadata is specified by the ``NT_AMD_HSA_METADATA`` note record
2523 (see :ref:`amdgpu-note-records-v2`).
2525 The metadata is specified as a YAML formatted string (see [YAML]_ and
2530 Is the string null terminated? It probably should not if YAML allows it to
2531 contain null characters, otherwise it should be.
2533 The metadata is represented as a single YAML document comprised of the mapping
2534 defined in table :ref:`amdgpu-amdhsa-code-object-metadata-map-v2-table` and
2537 For boolean values, the string values of ``false`` and ``true`` are used for
2538 false and true respectively.
2540 Additional information can be added to the mappings. To avoid conflicts, any
2541 non-AMD key names should be prefixed by "*vendor-name*.".
2543 .. table:: AMDHSA Code Object V2 Metadata Map
2544 :name: amdgpu-amdhsa-code-object-metadata-map-v2-table
2546 ========== ============== ========= =======================================
2547 String Key Value Type Required? Description
2548 ========== ============== ========= =======================================
2549 "Version" sequence of Required - The first integer is the major
2550 2 integers version. Currently 1.
2551 - The second integer is the minor
2552 version. Currently 0.
2553 "Printf" sequence of Each string is encoded information
2554 strings about a printf function call. The
2555 encoded information is organized as
2556 fields separated by colon (':'):
2558 ``ID:N:S[0]:S[1]:...:S[N-1]:FormatString``
2563 A 32-bit integer as a unique id for
2564 each printf function call
2567 A 32-bit integer equal to the number
2568 of arguments of printf function call
2571 ``S[i]`` (where i = 0, 1, ... , N-1)
2572 32-bit integers for the size in bytes
2573 of the i-th FormatString argument of
2574 the printf function call
2577 The format string passed to the
2578 printf function call.
2579 "Kernels" sequence of Required Sequence of the mappings for each
2580 mapping kernel in the code object. See
2581 :ref:`amdgpu-amdhsa-code-object-kernel-metadata-map-v2-table`
2582 for the definition of the mapping.
2583 ========== ============== ========= =======================================
2587 .. table:: AMDHSA Code Object V2 Kernel Metadata Map
2588 :name: amdgpu-amdhsa-code-object-kernel-metadata-map-v2-table
2590 ================= ============== ========= ================================
2591 String Key Value Type Required? Description
2592 ================= ============== ========= ================================
2593 "Name" string Required Source name of the kernel.
2594 "SymbolName" string Required Name of the kernel
2595 descriptor ELF symbol.
2596 "Language" string Source language of the kernel.
2604 "LanguageVersion" sequence of - The first integer is the major
2606 - The second integer is the
2608 "Attrs" mapping Mapping of kernel attributes.
2610 :ref:`amdgpu-amdhsa-code-object-kernel-attribute-metadata-map-v2-table`
2611 for the mapping definition.
2612 "Args" sequence of Sequence of mappings of the
2613 mapping kernel arguments. See
2614 :ref:`amdgpu-amdhsa-code-object-kernel-argument-metadata-map-v2-table`
2615 for the definition of the mapping.
2616 "CodeProps" mapping Mapping of properties related to
2617 the kernel code. See
2618 :ref:`amdgpu-amdhsa-code-object-kernel-code-properties-metadata-map-v2-table`
2619 for the mapping definition.
2620 ================= ============== ========= ================================
2624 .. table:: AMDHSA Code Object V2 Kernel Attribute Metadata Map
2625 :name: amdgpu-amdhsa-code-object-kernel-attribute-metadata-map-v2-table
2627 =================== ============== ========= ==============================
2628 String Key Value Type Required? Description
2629 =================== ============== ========= ==============================
2630 "ReqdWorkGroupSize" sequence of If not 0, 0, 0 then all values
2631 3 integers must be >=1 and the dispatch
2632 work-group size X, Y, Z must
2633 correspond to the specified
2634 values. Defaults to 0, 0, 0.
2636 Corresponds to the OpenCL
2637 ``reqd_work_group_size``
2639 "WorkGroupSizeHint" sequence of The dispatch work-group size
2640 3 integers X, Y, Z is likely to be the
2643 Corresponds to the OpenCL
2644 ``work_group_size_hint``
2646 "VecTypeHint" string The name of a scalar or vector
2649 Corresponds to the OpenCL
2650 ``vec_type_hint`` attribute.
2652 "RuntimeHandle" string The external symbol name
2653 associated with a kernel.
2654 OpenCL runtime allocates a
2655 global buffer for the symbol
2656 and saves the kernel's address
2657 to it, which is used for
2658 device side enqueueing. Only
2659 available for device side
2661 =================== ============== ========= ==============================
2665 .. table:: AMDHSA Code Object V2 Kernel Argument Metadata Map
2666 :name: amdgpu-amdhsa-code-object-kernel-argument-metadata-map-v2-table
2668 ================= ============== ========= ================================
2669 String Key Value Type Required? Description
2670 ================= ============== ========= ================================
2671 "Name" string Kernel argument name.
2672 "TypeName" string Kernel argument type name.
2673 "Size" integer Required Kernel argument size in bytes.
2674 "Align" integer Required Kernel argument alignment in
2675 bytes. Must be a power of two.
2676 "ValueKind" string Required Kernel argument kind that
2677 specifies how to set up the
2678 corresponding argument.
2682 The argument is copied
2683 directly into the kernarg.
2686 A global address space pointer
2687 to the buffer data is passed
2690 "DynamicSharedPointer"
2691 A group address space pointer
2692 to dynamically allocated LDS
2693 is passed in the kernarg.
2696 A global address space
2697 pointer to a S# is passed in
2701 A global address space
2702 pointer to a T# is passed in
2706 A global address space pointer
2707 to an OpenCL pipe is passed in
2711 A global address space pointer
2712 to an OpenCL device enqueue
2713 queue is passed in the
2716 "HiddenGlobalOffsetX"
2717 The OpenCL grid dispatch
2718 global offset for the X
2719 dimension is passed in the
2722 "HiddenGlobalOffsetY"
2723 The OpenCL grid dispatch
2724 global offset for the Y
2725 dimension is passed in the
2728 "HiddenGlobalOffsetZ"
2729 The OpenCL grid dispatch
2730 global offset for the Z
2731 dimension is passed in the
2735 An argument that is not used
2736 by the kernel. Space needs to
2737 be left for it, but it does
2738 not need to be set up.
2740 "HiddenPrintfBuffer"
2741 A global address space pointer
2742 to the runtime printf buffer
2743 is passed in kernarg.
2745 "HiddenHostcallBuffer"
2746 A global address space pointer
2747 to the runtime hostcall buffer
2748 is passed in kernarg.
2750 "HiddenDefaultQueue"
2751 A global address space pointer
2752 to the OpenCL device enqueue
2753 queue that should be used by
2754 the kernel by default is
2755 passed in the kernarg.
2757 "HiddenCompletionAction"
2758 A global address space pointer
2759 to help link enqueued kernels into
2760 the ancestor tree for determining
2761 when the parent kernel has finished.
2763 "HiddenMultiGridSyncArg"
2764 A global address space pointer for
2765 multi-grid synchronization is
2766 passed in the kernarg.
2768 "ValueType" string Unused and deprecated. This should no longer
2769 be emitted, but is accepted for compatibility.
2772 "PointeeAlign" integer Alignment in bytes of pointee
2773 type for pointer type kernel
2774 argument. Must be a power
2775 of 2. Only present if
2777 "DynamicSharedPointer".
2778 "AddrSpaceQual" string Kernel argument address space
2779 qualifier. Only present if
2780 "ValueKind" is "GlobalBuffer" or
2781 "DynamicSharedPointer". Values
2793 Is GlobalBuffer only Global
2795 DynamicSharedPointer always
2796 Local? Can HCC allow Generic?
2797 How can Private or Region
2800 "AccQual" string Kernel argument access
2801 qualifier. Only present if
2802 "ValueKind" is "Image" or
2815 "ActualAccQual" string The actual memory accesses
2816 performed by the kernel on the
2817 kernel argument. Only present if
2818 "ValueKind" is "GlobalBuffer",
2819 "Image", or "Pipe". This may be
2820 more restrictive than indicated
2821 by "AccQual" to reflect what the
2822 kernel actual does. If not
2823 present then the runtime must
2824 assume what is implied by
2825 "AccQual" and "IsConst". Values
2832 "IsConst" boolean Indicates if the kernel argument
2833 is const qualified. Only present
2837 "IsRestrict" boolean Indicates if the kernel argument
2838 is restrict qualified. Only
2839 present if "ValueKind" is
2842 "IsVolatile" boolean Indicates if the kernel argument
2843 is volatile qualified. Only
2844 present if "ValueKind" is
2847 "IsPipe" boolean Indicates if the kernel argument
2848 is pipe qualified. Only present
2849 if "ValueKind" is "Pipe".
2853 Can GlobalBuffer be pipe
2856 ================= ============== ========= ================================
2860 .. table:: AMDHSA Code Object V2 Kernel Code Properties Metadata Map
2861 :name: amdgpu-amdhsa-code-object-kernel-code-properties-metadata-map-v2-table
2863 ============================ ============== ========= =====================
2864 String Key Value Type Required? Description
2865 ============================ ============== ========= =====================
2866 "KernargSegmentSize" integer Required The size in bytes of
2868 that holds the values
2871 "GroupSegmentFixedSize" integer Required The amount of group
2875 bytes. This does not
2877 dynamically allocated
2878 group segment memory
2882 "PrivateSegmentFixedSize" integer Required The amount of fixed
2883 private address space
2884 memory required for a
2886 bytes. If the kernel
2888 stack then additional
2890 to this value for the
2892 "KernargSegmentAlign" integer Required The maximum byte
2895 kernarg segment. Must
2897 "WavefrontSize" integer Required Wavefront size. Must
2899 "NumSGPRs" integer Required Number of scalar
2903 includes the special
2905 Scratch (GFX7-GFX10)
2907 GFX8-GFX10). It does
2909 SGPR added if a trap
2915 "NumVGPRs" integer Required Number of vector
2919 "MaxFlatWorkGroupSize" integer Required Maximum flat
2922 kernel in work-items.
2925 ReqdWorkGroupSize if
2927 "NumSpilledSGPRs" integer Number of stores from
2928 a scalar register to
2929 a register allocator
2932 "NumSpilledVGPRs" integer Number of stores from
2933 a vector register to
2934 a register allocator
2937 ============================ ============== ========= =====================
2939 .. _amdgpu-amdhsa-code-object-metadata-v3:
2941 Code Object V3 Metadata
2942 +++++++++++++++++++++++
2944 Code object V3 to V4 metadata is specified by the ``NT_AMDGPU_METADATA`` note
2945 record (see :ref:`amdgpu-note-records-v3-v4`).
2947 The metadata is represented as Message Pack formatted binary data (see
2948 [MsgPack]_). The top level is a Message Pack map that includes the
2949 keys defined in table
2950 :ref:`amdgpu-amdhsa-code-object-metadata-map-table-v3` and referenced
2953 Additional information can be added to the maps. To avoid conflicts,
2954 any key names should be prefixed by "*vendor-name*." where
2955 ``vendor-name`` can be the name of the vendor and specific vendor
2956 tool that generates the information. The prefix is abbreviated to
2957 simply "." when it appears within a map that has been added by the
2960 .. table:: AMDHSA Code Object V3 Metadata Map
2961 :name: amdgpu-amdhsa-code-object-metadata-map-table-v3
2963 ================= ============== ========= =======================================
2964 String Key Value Type Required? Description
2965 ================= ============== ========= =======================================
2966 "amdhsa.version" sequence of Required - The first integer is the major
2967 2 integers version. Currently 1.
2968 - The second integer is the minor
2969 version. Currently 0.
2970 "amdhsa.printf" sequence of Each string is encoded information
2971 strings about a printf function call. The
2972 encoded information is organized as
2973 fields separated by colon (':'):
2975 ``ID:N:S[0]:S[1]:...:S[N-1]:FormatString``
2980 A 32-bit integer as a unique id for
2981 each printf function call
2984 A 32-bit integer equal to the number
2985 of arguments of printf function call
2988 ``S[i]`` (where i = 0, 1, ... , N-1)
2989 32-bit integers for the size in bytes
2990 of the i-th FormatString argument of
2991 the printf function call
2994 The format string passed to the
2995 printf function call.
2996 "amdhsa.kernels" sequence of Required Sequence of the maps for each
2997 map kernel in the code object. See
2998 :ref:`amdgpu-amdhsa-code-object-kernel-metadata-map-table-v3`
2999 for the definition of the keys included
3001 ================= ============== ========= =======================================
3005 .. table:: AMDHSA Code Object V3 Kernel Metadata Map
3006 :name: amdgpu-amdhsa-code-object-kernel-metadata-map-table-v3
3008 =================================== ============== ========= ================================
3009 String Key Value Type Required? Description
3010 =================================== ============== ========= ================================
3011 ".name" string Required Source name of the kernel.
3012 ".symbol" string Required Name of the kernel
3013 descriptor ELF symbol.
3014 ".language" string Source language of the kernel.
3024 ".language_version" sequence of - The first integer is the major
3026 - The second integer is the
3028 ".args" sequence of Sequence of maps of the
3029 map kernel arguments. See
3030 :ref:`amdgpu-amdhsa-code-object-kernel-argument-metadata-map-table-v3`
3031 for the definition of the keys
3032 included in that map.
3033 ".reqd_workgroup_size" sequence of If not 0, 0, 0 then all values
3034 3 integers must be >=1 and the dispatch
3035 work-group size X, Y, Z must
3036 correspond to the specified
3037 values. Defaults to 0, 0, 0.
3039 Corresponds to the OpenCL
3040 ``reqd_work_group_size``
3042 ".workgroup_size_hint" sequence of The dispatch work-group size
3043 3 integers X, Y, Z is likely to be the
3046 Corresponds to the OpenCL
3047 ``work_group_size_hint``
3049 ".vec_type_hint" string The name of a scalar or vector
3052 Corresponds to the OpenCL
3053 ``vec_type_hint`` attribute.
3055 ".device_enqueue_symbol" string The external symbol name
3056 associated with a kernel.
3057 OpenCL runtime allocates a
3058 global buffer for the symbol
3059 and saves the kernel's address
3060 to it, which is used for
3061 device side enqueueing. Only
3062 available for device side
3064 ".kernarg_segment_size" integer Required The size in bytes of
3066 that holds the values
3069 ".group_segment_fixed_size" integer Required The amount of group
3073 bytes. This does not
3075 dynamically allocated
3076 group segment memory
3080 ".private_segment_fixed_size" integer Required The amount of fixed
3081 private address space
3082 memory required for a
3084 bytes. If the kernel
3086 stack then additional
3088 to this value for the
3090 ".kernarg_segment_align" integer Required The maximum byte
3093 kernarg segment. Must
3095 ".wavefront_size" integer Required Wavefront size. Must
3097 ".sgpr_count" integer Required Number of scalar
3098 registers required by a
3100 GFX6-GFX9. A register
3101 is required if it is
3103 if a higher numbered
3106 includes the special
3112 SGPR added if a trap
3118 ".vgpr_count" integer Required Number of vector
3119 registers required by
3121 GFX6-GFX9. A register
3122 is required if it is
3124 if a higher numbered
3127 ".max_flat_workgroup_size" integer Required Maximum flat
3130 kernel in work-items.
3133 ReqdWorkGroupSize if
3135 ".sgpr_spill_count" integer Number of stores from
3136 a scalar register to
3137 a register allocator
3140 ".vgpr_spill_count" integer Number of stores from
3141 a vector register to
3142 a register allocator
3145 ".kind" string The kind of the kernel
3153 These kernels must be
3154 invoked after loading
3164 These kernels must be
3167 containing code object
3168 and after all init and
3169 normal kernels in the
3170 same code object have
3174 If omitted, "normal" is
3176 =================================== ============== ========= ================================
3180 .. table:: AMDHSA Code Object V3 Kernel Argument Metadata Map
3181 :name: amdgpu-amdhsa-code-object-kernel-argument-metadata-map-table-v3
3183 ====================== ============== ========= ================================
3184 String Key Value Type Required? Description
3185 ====================== ============== ========= ================================
3186 ".name" string Kernel argument name.
3187 ".type_name" string Kernel argument type name.
3188 ".size" integer Required Kernel argument size in bytes.
3189 ".offset" integer Required Kernel argument offset in
3190 bytes. The offset must be a
3191 multiple of the alignment
3192 required by the argument.
3193 ".value_kind" string Required Kernel argument kind that
3194 specifies how to set up the
3195 corresponding argument.
3199 The argument is copied
3200 directly into the kernarg.
3203 A global address space pointer
3204 to the buffer data is passed
3207 "dynamic_shared_pointer"
3208 A group address space pointer
3209 to dynamically allocated LDS
3210 is passed in the kernarg.
3213 A global address space
3214 pointer to a S# is passed in
3218 A global address space
3219 pointer to a T# is passed in
3223 A global address space pointer
3224 to an OpenCL pipe is passed in
3228 A global address space pointer
3229 to an OpenCL device enqueue
3230 queue is passed in the
3233 "hidden_global_offset_x"
3234 The OpenCL grid dispatch
3235 global offset for the X
3236 dimension is passed in the
3239 "hidden_global_offset_y"
3240 The OpenCL grid dispatch
3241 global offset for the Y
3242 dimension is passed in the
3245 "hidden_global_offset_z"
3246 The OpenCL grid dispatch
3247 global offset for the Z
3248 dimension is passed in the
3252 An argument that is not used
3253 by the kernel. Space needs to
3254 be left for it, but it does
3255 not need to be set up.
3257 "hidden_printf_buffer"
3258 A global address space pointer
3259 to the runtime printf buffer
3260 is passed in kernarg.
3262 "hidden_hostcall_buffer"
3263 A global address space pointer
3264 to the runtime hostcall buffer
3265 is passed in kernarg.
3267 "hidden_default_queue"
3268 A global address space pointer
3269 to the OpenCL device enqueue
3270 queue that should be used by
3271 the kernel by default is
3272 passed in the kernarg.
3274 "hidden_completion_action"
3275 A global address space pointer
3276 to help link enqueued kernels into
3277 the ancestor tree for determining
3278 when the parent kernel has finished.
3280 "hidden_multigrid_sync_arg"
3281 A global address space pointer for
3282 multi-grid synchronization is
3283 passed in the kernarg.
3285 ".value_type" string Unused and deprecated. This should no longer
3286 be emitted, but is accepted for compatibility.
3288 ".pointee_align" integer Alignment in bytes of pointee
3289 type for pointer type kernel
3290 argument. Must be a power
3291 of 2. Only present if
3293 "dynamic_shared_pointer".
3294 ".address_space" string Kernel argument address space
3295 qualifier. Only present if
3296 ".value_kind" is "global_buffer" or
3297 "dynamic_shared_pointer". Values
3309 Is "global_buffer" only "global"
3311 "dynamic_shared_pointer" always
3312 "local"? Can HCC allow "generic"?
3313 How can "private" or "region"
3316 ".access" string Kernel argument access
3317 qualifier. Only present if
3318 ".value_kind" is "image" or
3331 ".actual_access" string The actual memory accesses
3332 performed by the kernel on the
3333 kernel argument. Only present if
3334 ".value_kind" is "global_buffer",
3335 "image", or "pipe". This may be
3336 more restrictive than indicated
3337 by ".access" to reflect what the
3338 kernel actual does. If not
3339 present then the runtime must
3340 assume what is implied by
3341 ".access" and ".is_const" . Values
3348 ".is_const" boolean Indicates if the kernel argument
3349 is const qualified. Only present
3353 ".is_restrict" boolean Indicates if the kernel argument
3354 is restrict qualified. Only
3355 present if ".value_kind" is
3358 ".is_volatile" boolean Indicates if the kernel argument
3359 is volatile qualified. Only
3360 present if ".value_kind" is
3363 ".is_pipe" boolean Indicates if the kernel argument
3364 is pipe qualified. Only present
3365 if ".value_kind" is "pipe".
3369 Can "global_buffer" be pipe
3372 ====================== ============== ========= ================================
3374 .. _amdgpu-amdhsa-code-object-metadata-v4:
3376 Code Object V4 Metadata
3377 +++++++++++++++++++++++
3380 Code object V4 is not the default code object version emitted by this version
3383 Code object V4 metadata is the same as
3384 :ref:`amdgpu-amdhsa-code-object-metadata-v3` with the changes and additions
3385 defined in table :ref:`amdgpu-amdhsa-code-object-metadata-map-table-v3`.
3387 .. table:: AMDHSA Code Object V4 Metadata Map Changes from :ref:`amdgpu-amdhsa-code-object-metadata-v3`
3388 :name: amdgpu-amdhsa-code-object-metadata-map-table-v4
3390 ================= ============== ========= =======================================
3391 String Key Value Type Required? Description
3392 ================= ============== ========= =======================================
3393 "amdhsa.version" sequence of Required - The first integer is the major
3394 2 integers version. Currently 1.
3395 - The second integer is the minor
3396 version. Currently 1.
3397 "amdhsa.target" string Required The target name of the code using the syntax:
3401 <target-triple> [ "-" <target-id> ]
3403 A canonical target ID must be
3404 used. See :ref:`amdgpu-target-triples`
3405 and :ref:`amdgpu-target-id`.
3406 ================= ============== ========= =======================================
3413 The HSA architected queuing language (AQL) defines a user space memory interface
3414 that can be used to control the dispatch of kernels, in an agent independent
3415 way. An agent can have zero or more AQL queues created for it using an HSA
3416 compatible runtime (see :ref:`amdgpu-os`), in which AQL packets (all of which
3417 are 64 bytes) can be placed. See the *HSA Platform System Architecture
3418 Specification* [HSA]_ for the AQL queue mechanics and packet layouts.
3420 The packet processor of a kernel agent is responsible for detecting and
3421 dispatching HSA kernels from the AQL queues associated with it. For AMD GPUs the
3422 packet processor is implemented by the hardware command processor (CP),
3423 asynchronous dispatch controller (ADC) and shader processor input controller
3426 An HSA compatible runtime can be used to allocate an AQL queue object. It uses
3427 the kernel mode driver to initialize and register the AQL queue with CP.
3429 To dispatch a kernel the following actions are performed. This can occur in the
3430 CPU host program, or from an HSA kernel executing on a GPU.
3432 1. A pointer to an AQL queue for the kernel agent on which the kernel is to be
3433 executed is obtained.
3434 2. A pointer to the kernel descriptor (see
3435 :ref:`amdgpu-amdhsa-kernel-descriptor`) of the kernel to execute is obtained.
3436 It must be for a kernel that is contained in a code object that that was
3437 loaded by an HSA compatible runtime on the kernel agent with which the AQL
3438 queue is associated.
3439 3. Space is allocated for the kernel arguments using the HSA compatible runtime
3440 allocator for a memory region with the kernarg property for the kernel agent
3441 that will execute the kernel. It must be at least 16-byte aligned.
3442 4. Kernel argument values are assigned to the kernel argument memory
3443 allocation. The layout is defined in the *HSA Programmer's Language
3444 Reference* [HSA]_. For AMDGPU the kernel execution directly accesses the
3445 kernel argument memory in the same way constant memory is accessed. (Note
3446 that the HSA specification allows an implementation to copy the kernel
3447 argument contents to another location that is accessed by the kernel.)
3448 5. An AQL kernel dispatch packet is created on the AQL queue. The HSA compatible
3449 runtime api uses 64-bit atomic operations to reserve space in the AQL queue
3450 for the packet. The packet must be set up, and the final write must use an
3451 atomic store release to set the packet kind to ensure the packet contents are
3452 visible to the kernel agent. AQL defines a doorbell signal mechanism to
3453 notify the kernel agent that the AQL queue has been updated. These rules, and
3454 the layout of the AQL queue and kernel dispatch packet is defined in the *HSA
3455 System Architecture Specification* [HSA]_.
3456 6. A kernel dispatch packet includes information about the actual dispatch,
3457 such as grid and work-group size, together with information from the code
3458 object about the kernel, such as segment sizes. The HSA compatible runtime
3459 queries on the kernel symbol can be used to obtain the code object values
3460 which are recorded in the :ref:`amdgpu-amdhsa-code-object-metadata`.
3461 7. CP executes micro-code and is responsible for detecting and setting up the
3462 GPU to execute the wavefronts of a kernel dispatch.
3463 8. CP ensures that when the a wavefront starts executing the kernel machine
3464 code, the scalar general purpose registers (SGPR) and vector general purpose
3465 registers (VGPR) are set up as required by the machine code. The required
3466 setup is defined in the :ref:`amdgpu-amdhsa-kernel-descriptor`. The initial
3467 register state is defined in
3468 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`.
3469 9. The prolog of the kernel machine code (see
3470 :ref:`amdgpu-amdhsa-kernel-prolog`) sets up the machine state as necessary
3471 before continuing executing the machine code that corresponds to the kernel.
3472 10. When the kernel dispatch has completed execution, CP signals the completion
3473 signal specified in the kernel dispatch packet if not 0.
3475 .. _amdgpu-amdhsa-memory-spaces:
3480 The memory space properties are:
3482 .. table:: AMDHSA Memory Spaces
3483 :name: amdgpu-amdhsa-memory-spaces-table
3485 ================= =========== ======== ======= ==================
3486 Memory Space Name HSA Segment Hardware Address NULL Value
3488 ================= =========== ======== ======= ==================
3489 Private private scratch 32 0x00000000
3490 Local group LDS 32 0xFFFFFFFF
3491 Global global global 64 0x0000000000000000
3492 Constant constant *same as 64 0x0000000000000000
3494 Generic flat flat 64 0x0000000000000000
3495 Region N/A GDS 32 *not implemented
3497 ================= =========== ======== ======= ==================
3499 The global and constant memory spaces both use global virtual addresses, which
3500 are the same virtual address space used by the CPU. However, some virtual
3501 addresses may only be accessible to the CPU, some only accessible by the GPU,
3504 Using the constant memory space indicates that the data will not change during
3505 the execution of the kernel. This allows scalar read instructions to be
3506 used. The vector and scalar L1 caches are invalidated of volatile data before
3507 each kernel dispatch execution to allow constant memory to change values between
3510 The local memory space uses the hardware Local Data Store (LDS) which is
3511 automatically allocated when the hardware creates work-groups of wavefronts, and
3512 freed when all the wavefronts of a work-group have terminated. The data store
3513 (DS) instructions can be used to access it.
3515 The private memory space uses the hardware scratch memory support. If the kernel
3516 uses scratch, then the hardware allocates memory that is accessed using
3517 wavefront lane dword (4 byte) interleaving. The mapping used from private
3518 address to physical address is:
3520 ``wavefront-scratch-base +
3521 (private-address * wavefront-size * 4) +
3522 (wavefront-lane-id * 4)``
3524 There are different ways that the wavefront scratch base address is determined
3525 by a wavefront (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). This
3526 memory can be accessed in an interleaved manner using buffer instruction with
3527 the scratch buffer descriptor and per wavefront scratch offset, by the scratch
3528 instructions, or by flat instructions. If each lane of a wavefront accesses the
3529 same private address, the interleaving results in adjacent dwords being accessed
3530 and hence requires fewer cache lines to be fetched. Multi-dword access is not
3531 supported except by flat and scratch instructions in GFX9-GFX10.
3533 The generic address space uses the hardware flat address support available in
3534 GFX7-GFX10. This uses two fixed ranges of virtual addresses (the private and
3535 local apertures), that are outside the range of addressible global memory, to
3536 map from a flat address to a private or local address.
3538 FLAT instructions can take a flat address and access global, private (scratch)
3539 and group (LDS) memory depending in if the address is within one of the
3540 aperture ranges. Flat access to scratch requires hardware aperture setup and
3541 setup in the kernel prologue (see
3542 :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`). Flat access to LDS requires
3543 hardware aperture setup and M0 (GFX7-GFX8) register setup (see
3544 :ref:`amdgpu-amdhsa-kernel-prolog-m0`).
3546 To convert between a segment address and a flat address the base address of the
3547 apertures address can be used. For GFX7-GFX8 these are available in the
3548 :ref:`amdgpu-amdhsa-hsa-aql-queue` the address of which can be obtained with
3549 Queue Ptr SGPR (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). For
3550 GFX9-GFX10 the aperture base addresses are directly available as inline constant
3551 registers ``SRC_SHARED_BASE/LIMIT`` and ``SRC_PRIVATE_BASE/LIMIT``. In 64 bit
3552 address mode the aperture sizes are 2^32 bytes and the base is aligned to 2^32
3553 which makes it easier to convert from flat to segment or segment to flat.
3558 Image and sample handles created by an HSA compatible runtime (see
3559 :ref:`amdgpu-os`) are 64-bit addresses of a hardware 32-byte V# and 48 byte S#
3560 object respectively. In order to support the HSA ``query_sampler`` operations
3561 two extra dwords are used to store the HSA BRIG enumeration values for the
3562 queries that are not trivially deducible from the S# representation.
3567 HSA signal handles created by an HSA compatible runtime (see :ref:`amdgpu-os`)
3568 are 64-bit addresses of a structure allocated in memory accessible from both the
3569 CPU and GPU. The structure is defined by the runtime and subject to change
3570 between releases. For example, see [AMD-ROCm-github]_.
3572 .. _amdgpu-amdhsa-hsa-aql-queue:
3577 The HSA AQL queue structure is defined by an HSA compatible runtime (see
3578 :ref:`amdgpu-os`) and subject to change between releases. For example, see
3579 [AMD-ROCm-github]_. For some processors it contains fields needed to implement
3580 certain language features such as the flat address aperture bases. It also
3581 contains fields used by CP such as managing the allocation of scratch memory.
3583 .. _amdgpu-amdhsa-kernel-descriptor:
3588 A kernel descriptor consists of the information needed by CP to initiate the
3589 execution of a kernel, including the entry point address of the machine code
3590 that implements the kernel.
3592 Code Object V3 Kernel Descriptor
3593 ++++++++++++++++++++++++++++++++
3595 CP microcode requires the Kernel descriptor to be allocated on 64-byte
3598 The fields used by CP for code objects before V3 also match those specified in
3599 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
3601 .. table:: Code Object V3 Kernel Descriptor
3602 :name: amdgpu-amdhsa-kernel-descriptor-v3-table
3604 ======= ======= =============================== ============================
3605 Bits Size Field Name Description
3606 ======= ======= =============================== ============================
3607 31:0 4 bytes GROUP_SEGMENT_FIXED_SIZE The amount of fixed local
3608 address space memory
3609 required for a work-group
3610 in bytes. This does not
3611 include any dynamically
3612 allocated local address
3613 space memory that may be
3614 added when the kernel is
3616 63:32 4 bytes PRIVATE_SEGMENT_FIXED_SIZE The amount of fixed
3617 private address space
3618 memory required for a
3620 Additional space may need to
3621 be added to this value if
3623 non-inlined function calls.
3624 95:64 4 bytes KERNARG_SIZE The size of the kernarg
3625 memory pointed to by the
3626 AQL dispatch packet. The
3627 kernarg memory is used to
3628 pass arguments to the
3631 * If the kernarg pointer in
3632 the dispatch packet is NULL
3633 then there are no kernel
3635 * If the kernarg pointer in
3636 the dispatch packet is
3637 not NULL and this value
3638 is 0 then the kernarg
3641 * If the kernarg pointer in
3642 the dispatch packet is
3643 not NULL and this value
3644 is not 0 then the value
3645 specifies the kernarg
3646 memory size in bytes. It
3647 is recommended to provide
3648 a value as it may be used
3649 by CP to optimize making
3651 visible to the kernel
3654 127:96 4 bytes Reserved, must be 0.
3655 191:128 8 bytes KERNEL_CODE_ENTRY_BYTE_OFFSET Byte offset (possibly
3658 descriptor to kernel's
3659 entry point instruction
3660 which must be 256 byte
3662 351:272 20 Reserved, must be 0.
3664 383:352 4 bytes COMPUTE_PGM_RSRC3 GFX6-GFX9
3665 Reserved, must be 0.
3668 program settings used by
3670 ``COMPUTE_PGM_RSRC3``
3673 :ref:`amdgpu-amdhsa-compute_pgm_rsrc3-gfx90a-table`.
3676 program settings used by
3678 ``COMPUTE_PGM_RSRC3``
3681 :ref:`amdgpu-amdhsa-compute_pgm_rsrc3-gfx10-table`.
3682 415:384 4 bytes COMPUTE_PGM_RSRC1 Compute Shader (CS)
3683 program settings used by
3685 ``COMPUTE_PGM_RSRC1``
3688 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
3689 447:416 4 bytes COMPUTE_PGM_RSRC2 Compute Shader (CS)
3690 program settings used by
3692 ``COMPUTE_PGM_RSRC2``
3695 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`.
3696 458:448 7 bits *See separate bits below.* Enable the setup of the
3697 SGPR user data registers
3699 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
3701 The total number of SGPR
3703 requested must not exceed
3704 16 and match value in
3705 ``compute_pgm_rsrc2.user_sgpr.user_sgpr_count``.
3706 Any requests beyond 16
3708 >448 1 bit ENABLE_SGPR_PRIVATE_SEGMENT If the *Target Properties*
3710 :ref:`amdgpu-processor-table`
3711 specifies *Architected flat
3712 scratch* then not supported
3714 >449 1 bit ENABLE_SGPR_DISPATCH_PTR
3715 >450 1 bit ENABLE_SGPR_QUEUE_PTR
3716 >451 1 bit ENABLE_SGPR_KERNARG_SEGMENT_PTR
3717 >452 1 bit ENABLE_SGPR_DISPATCH_ID
3718 >453 1 bit ENABLE_SGPR_FLAT_SCRATCH_INIT If the *Target Properties*
3720 :ref:`amdgpu-processor-table`
3721 specifies *Architected flat
3722 scratch* then not supported
3724 >454 1 bit ENABLE_SGPR_PRIVATE_SEGMENT
3726 457:455 3 bits Reserved, must be 0.
3727 458 1 bit ENABLE_WAVEFRONT_SIZE32 GFX6-GFX9
3728 Reserved, must be 0.
3731 wavefront size 64 mode.
3733 native wavefront size
3735 463:459 1 bit Reserved, must be 0.
3736 464 1 bit RESERVED_464 Deprecated, must be 0.
3737 467:465 3 bits Reserved, must be 0.
3738 468 1 bit RESERVED_468 Deprecated, must be 0.
3739 469:471 3 bits Reserved, must be 0.
3740 511:472 5 bytes Reserved, must be 0.
3741 512 **Total size 64 bytes.**
3742 ======= ====================================================================
3746 .. table:: compute_pgm_rsrc1 for GFX6-GFX10
3747 :name: amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table
3749 ======= ======= =============================== ===========================================================================
3750 Bits Size Field Name Description
3751 ======= ======= =============================== ===========================================================================
3752 5:0 6 bits GRANULATED_WORKITEM_VGPR_COUNT Number of vector register
3753 blocks used by each work-item;
3754 granularity is device
3759 - max(0, ceil(vgprs_used / 4) - 1)
3762 - vgprs_used = align(arch_vgprs, 4)
3764 - max(0, ceil(vgprs_used / 8) - 1)
3765 GFX10 (wavefront size 64)
3767 - max(0, ceil(vgprs_used / 4) - 1)
3768 GFX10 (wavefront size 32)
3770 - max(0, ceil(vgprs_used / 8) - 1)
3772 Where vgprs_used is defined
3773 as the highest VGPR number
3774 explicitly referenced plus
3777 Used by CP to set up
3778 ``COMPUTE_PGM_RSRC1.VGPRS``.
3781 :ref:`amdgpu-assembler`
3783 automatically for the
3784 selected processor from
3785 values provided to the
3786 `.amdhsa_kernel` directive
3788 `.amdhsa_next_free_vgpr`
3789 nested directive (see
3790 :ref:`amdhsa-kernel-directives-table`).
3791 9:6 4 bits GRANULATED_WAVEFRONT_SGPR_COUNT Number of scalar register
3792 blocks used by a wavefront;
3793 granularity is device
3798 - max(0, ceil(sgprs_used / 8) - 1)
3801 - 2 * max(0, ceil(sgprs_used / 16) - 1)
3803 Reserved, must be 0.
3808 defined as the highest
3809 SGPR number explicitly
3810 referenced plus one, plus
3811 a target specific number
3812 of additional special
3814 FLAT_SCRATCH (GFX7+) and
3815 XNACK_MASK (GFX8+), and
3818 limitations. It does not
3819 include the 16 SGPRs added
3820 if a trap handler is
3824 limitations and special
3825 SGPR layout are defined in
3827 documentation, which can
3829 :ref:`amdgpu-processors`
3832 Used by CP to set up
3833 ``COMPUTE_PGM_RSRC1.SGPRS``.
3836 :ref:`amdgpu-assembler`
3838 automatically for the
3839 selected processor from
3840 values provided to the
3841 `.amdhsa_kernel` directive
3843 `.amdhsa_next_free_sgpr`
3844 and `.amdhsa_reserve_*`
3845 nested directives (see
3846 :ref:`amdhsa-kernel-directives-table`).
3847 11:10 2 bits PRIORITY Must be 0.
3849 Start executing wavefront
3850 at the specified priority.
3852 CP is responsible for
3854 ``COMPUTE_PGM_RSRC1.PRIORITY``.
3855 13:12 2 bits FLOAT_ROUND_MODE_32 Wavefront starts execution
3856 with specified rounding
3859 precision floating point
3862 Floating point rounding
3863 mode values are defined in
3864 :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
3866 Used by CP to set up
3867 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
3868 15:14 2 bits FLOAT_ROUND_MODE_16_64 Wavefront starts execution
3869 with specified rounding
3870 denorm mode for half/double (16
3871 and 64-bit) floating point
3872 precision floating point
3875 Floating point rounding
3876 mode values are defined in
3877 :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
3879 Used by CP to set up
3880 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
3881 17:16 2 bits FLOAT_DENORM_MODE_32 Wavefront starts execution
3882 with specified denorm mode
3885 precision floating point
3888 Floating point denorm mode
3889 values are defined in
3890 :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
3892 Used by CP to set up
3893 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
3894 19:18 2 bits FLOAT_DENORM_MODE_16_64 Wavefront starts execution
3895 with specified denorm mode
3897 and 64-bit) floating point
3898 precision floating point
3901 Floating point denorm mode
3902 values are defined in
3903 :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
3905 Used by CP to set up
3906 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
3907 20 1 bit PRIV Must be 0.
3909 Start executing wavefront
3910 in privilege trap handler
3913 CP is responsible for
3915 ``COMPUTE_PGM_RSRC1.PRIV``.
3916 21 1 bit ENABLE_DX10_CLAMP Wavefront starts execution
3917 with DX10 clamp mode
3918 enabled. Used by the vector
3919 ALU to force DX10 style
3920 treatment of NaN's (when
3921 set, clamp NaN to zero,
3925 Used by CP to set up
3926 ``COMPUTE_PGM_RSRC1.DX10_CLAMP``.
3927 22 1 bit DEBUG_MODE Must be 0.
3929 Start executing wavefront
3930 in single step mode.
3932 CP is responsible for
3934 ``COMPUTE_PGM_RSRC1.DEBUG_MODE``.
3935 23 1 bit ENABLE_IEEE_MODE Wavefront starts execution
3937 enabled. Floating point
3938 opcodes that support
3939 exception flag gathering
3940 will quiet and propagate
3941 signaling-NaN inputs per
3942 IEEE 754-2008. Min_dx10 and
3943 max_dx10 become IEEE
3944 754-2008 compliant due to
3945 signaling-NaN propagation
3948 Used by CP to set up
3949 ``COMPUTE_PGM_RSRC1.IEEE_MODE``.
3950 24 1 bit BULKY Must be 0.
3952 Only one work-group allowed
3953 to execute on a compute
3956 CP is responsible for
3958 ``COMPUTE_PGM_RSRC1.BULKY``.
3959 25 1 bit CDBG_USER Must be 0.
3961 Flag that can be used to
3962 control debugging code.
3964 CP is responsible for
3966 ``COMPUTE_PGM_RSRC1.CDBG_USER``.
3967 26 1 bit FP16_OVFL GFX6-GFX8
3968 Reserved, must be 0.
3970 Wavefront starts execution
3971 with specified fp16 overflow
3974 - If 0, fp16 overflow generates
3976 - If 1, fp16 overflow that is the
3977 result of an +/-INF input value
3978 or divide by 0 produces a +/-INF,
3979 otherwise clamps computed
3980 overflow to +/-MAX_FP16 as
3983 Used by CP to set up
3984 ``COMPUTE_PGM_RSRC1.FP16_OVFL``.
3985 28:27 2 bits Reserved, must be 0.
3986 29 1 bit WGP_MODE GFX6-GFX9
3987 Reserved, must be 0.
3989 - If 0 execute work-groups in
3990 CU wavefront execution mode.
3991 - If 1 execute work-groups on
3992 in WGP wavefront execution mode.
3994 See :ref:`amdgpu-amdhsa-memory-model`.
3996 Used by CP to set up
3997 ``COMPUTE_PGM_RSRC1.WGP_MODE``.
3998 30 1 bit MEM_ORDERED GFX6-GFX9
3999 Reserved, must be 0.
4001 Controls the behavior of the
4002 s_waitcnt's vmcnt and vscnt
4005 - If 0 vmcnt reports completion
4006 of load and atomic with return
4007 out of order with sample
4008 instructions, and the vscnt
4009 reports the completion of
4010 store and atomic without
4012 - If 1 vmcnt reports completion
4013 of load, atomic with return
4014 and sample instructions in
4015 order, and the vscnt reports
4016 the completion of store and
4017 atomic without return in order.
4019 Used by CP to set up
4020 ``COMPUTE_PGM_RSRC1.MEM_ORDERED``.
4021 31 1 bit FWD_PROGRESS GFX6-GFX9
4022 Reserved, must be 0.
4024 - If 0 execute SIMD wavefronts
4025 using oldest first policy.
4026 - If 1 execute SIMD wavefronts to
4027 ensure wavefronts will make some
4030 Used by CP to set up
4031 ``COMPUTE_PGM_RSRC1.FWD_PROGRESS``.
4032 32 **Total size 4 bytes**
4033 ======= ===================================================================================================================
4037 .. table:: compute_pgm_rsrc2 for GFX6-GFX10
4038 :name: amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table
4040 ======= ======= =============================== ===========================================================================
4041 Bits Size Field Name Description
4042 ======= ======= =============================== ===========================================================================
4043 0 1 bit ENABLE_PRIVATE_SEGMENT * Enable the setup of the
4045 * If the *Target Properties*
4047 :ref:`amdgpu-processor-table`
4050 scratch* then enable the
4052 wavefront scratch offset
4053 system register (see
4054 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
4055 * If the *Target Properties*
4057 :ref:`amdgpu-processor-table`
4058 specifies *Architected
4059 flat scratch* then enable
4061 FLAT_SCRATCH register
4063 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
4065 Used by CP to set up
4066 ``COMPUTE_PGM_RSRC2.SCRATCH_EN``.
4067 5:1 5 bits USER_SGPR_COUNT The total number of SGPR
4069 requested. This number must
4070 match the number of user
4071 data registers enabled.
4073 Used by CP to set up
4074 ``COMPUTE_PGM_RSRC2.USER_SGPR``.
4075 6 1 bit ENABLE_TRAP_HANDLER Must be 0.
4078 ``COMPUTE_PGM_RSRC2.TRAP_PRESENT``,
4079 which is set by the CP if
4080 the runtime has installed a
4082 7 1 bit ENABLE_SGPR_WORKGROUP_ID_X Enable the setup of the
4083 system SGPR register for
4084 the work-group id in the X
4086 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
4088 Used by CP to set up
4089 ``COMPUTE_PGM_RSRC2.TGID_X_EN``.
4090 8 1 bit ENABLE_SGPR_WORKGROUP_ID_Y Enable the setup of the
4091 system SGPR register for
4092 the work-group id in the Y
4094 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
4096 Used by CP to set up
4097 ``COMPUTE_PGM_RSRC2.TGID_Y_EN``.
4098 9 1 bit ENABLE_SGPR_WORKGROUP_ID_Z Enable the setup of the
4099 system SGPR register for
4100 the work-group id in the Z
4102 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
4104 Used by CP to set up
4105 ``COMPUTE_PGM_RSRC2.TGID_Z_EN``.
4106 10 1 bit ENABLE_SGPR_WORKGROUP_INFO Enable the setup of the
4107 system SGPR register for
4108 work-group information (see
4109 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
4111 Used by CP to set up
4112 ``COMPUTE_PGM_RSRC2.TGID_SIZE_EN``.
4113 12:11 2 bits ENABLE_VGPR_WORKITEM_ID Enable the setup of the
4114 VGPR system registers used
4115 for the work-item ID.
4116 :ref:`amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table`
4119 Used by CP to set up
4120 ``COMPUTE_PGM_RSRC2.TIDIG_CMP_CNT``.
4121 13 1 bit ENABLE_EXCEPTION_ADDRESS_WATCH Must be 0.
4123 Wavefront starts execution
4125 exceptions enabled which
4126 are generated when L1 has
4127 witnessed a thread access
4131 CP is responsible for
4132 filling in the address
4134 ``COMPUTE_PGM_RSRC2.EXCP_EN_MSB``
4135 according to what the
4137 14 1 bit ENABLE_EXCEPTION_MEMORY Must be 0.
4139 Wavefront starts execution
4140 with memory violation
4141 exceptions exceptions
4142 enabled which are generated
4143 when a memory violation has
4144 occurred for this wavefront from
4146 (write-to-read-only-memory,
4147 mis-aligned atomic, LDS
4148 address out of range,
4149 illegal address, etc.).
4153 ``COMPUTE_PGM_RSRC2.EXCP_EN_MSB``
4154 according to what the
4156 23:15 9 bits GRANULATED_LDS_SIZE Must be 0.
4158 CP uses the rounded value
4159 from the dispatch packet,
4160 not this value, as the
4161 dispatch may contain
4162 dynamically allocated group
4163 segment memory. CP writes
4165 ``COMPUTE_PGM_RSRC2.LDS_SIZE``.
4167 Amount of group segment
4168 (LDS) to allocate for each
4169 work-group. Granularity is
4173 roundup(lds-size / (64 * 4))
4175 roundup(lds-size / (128 * 4))
4177 24 1 bit ENABLE_EXCEPTION_IEEE_754_FP Wavefront starts execution
4178 _INVALID_OPERATION with specified exceptions
4181 Used by CP to set up
4182 ``COMPUTE_PGM_RSRC2.EXCP_EN``
4183 (set from bits 0..6).
4187 25 1 bit ENABLE_EXCEPTION_FP_DENORMAL FP Denormal one or more
4188 _SOURCE input operands is a
4190 26 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Division by
4191 _DIVISION_BY_ZERO Zero
4192 27 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP FP Overflow
4194 28 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Underflow
4196 29 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Inexact
4198 30 1 bit ENABLE_EXCEPTION_INT_DIVIDE_BY Integer Division by Zero
4199 _ZERO (rcp_iflag_f32 instruction
4201 31 1 bit Reserved, must be 0.
4202 32 **Total size 4 bytes.**
4203 ======= ===================================================================================================================
4207 .. table:: compute_pgm_rsrc3 for GFX90A
4208 :name: amdgpu-amdhsa-compute_pgm_rsrc3-gfx90a-table
4210 ======= ======= =============================== ===========================================================================
4211 Bits Size Field Name Description
4212 ======= ======= =============================== ===========================================================================
4213 5:0 6 bits ACCUM_OFFSET Offset of a first AccVGPR in the unified register file. Granularity 4.
4214 Value 0-63. 0 - accum-offset = 4, 1 - accum-offset = 8, ...,
4215 63 - accum-offset = 256.
4216 6:15 10 Reserved, must be 0.
4218 16 1 bit TG_SPLIT - If 0 the waves of a work-group are
4219 launched in the same CU.
4220 - If 1 the waves of a work-group can be
4221 launched in different CUs. The waves
4222 cannot use S_BARRIER or LDS.
4223 17:31 15 Reserved, must be 0.
4225 32 **Total size 4 bytes.**
4226 ======= ===================================================================================================================
4230 .. table:: compute_pgm_rsrc3 for GFX10
4231 :name: amdgpu-amdhsa-compute_pgm_rsrc3-gfx10-table
4233 ======= ======= =============================== ===========================================================================
4234 Bits Size Field Name Description
4235 ======= ======= =============================== ===========================================================================
4236 3:0 4 bits SHARED_VGPR_COUNT Number of shared VGPRs for wavefront size 64. Granularity 8. Value 0-120.
4237 compute_pgm_rsrc1.vgprs + shared_vgpr_cnt cannot exceed 64.
4238 31:4 28 Reserved, must be 0.
4240 32 **Total size 4 bytes.**
4241 ======= ===================================================================================================================
4245 .. table:: Floating Point Rounding Mode Enumeration Values
4246 :name: amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table
4248 ====================================== ===== ==============================
4249 Enumeration Name Value Description
4250 ====================================== ===== ==============================
4251 FLOAT_ROUND_MODE_NEAR_EVEN 0 Round Ties To Even
4252 FLOAT_ROUND_MODE_PLUS_INFINITY 1 Round Toward +infinity
4253 FLOAT_ROUND_MODE_MINUS_INFINITY 2 Round Toward -infinity
4254 FLOAT_ROUND_MODE_ZERO 3 Round Toward 0
4255 ====================================== ===== ==============================
4259 .. table:: Floating Point Denorm Mode Enumeration Values
4260 :name: amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table
4262 ====================================== ===== ==============================
4263 Enumeration Name Value Description
4264 ====================================== ===== ==============================
4265 FLOAT_DENORM_MODE_FLUSH_SRC_DST 0 Flush Source and Destination
4267 FLOAT_DENORM_MODE_FLUSH_DST 1 Flush Output Denorms
4268 FLOAT_DENORM_MODE_FLUSH_SRC 2 Flush Source Denorms
4269 FLOAT_DENORM_MODE_FLUSH_NONE 3 No Flush
4270 ====================================== ===== ==============================
4274 .. table:: System VGPR Work-Item ID Enumeration Values
4275 :name: amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table
4277 ======================================== ===== ============================
4278 Enumeration Name Value Description
4279 ======================================== ===== ============================
4280 SYSTEM_VGPR_WORKITEM_ID_X 0 Set work-item X dimension
4282 SYSTEM_VGPR_WORKITEM_ID_X_Y 1 Set work-item X and Y
4284 SYSTEM_VGPR_WORKITEM_ID_X_Y_Z 2 Set work-item X, Y and Z
4286 SYSTEM_VGPR_WORKITEM_ID_UNDEFINED 3 Undefined.
4287 ======================================== ===== ============================
4289 .. _amdgpu-amdhsa-initial-kernel-execution-state:
4291 Initial Kernel Execution State
4292 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
4294 This section defines the register state that will be set up by the packet
4295 processor prior to the start of execution of every wavefront. This is limited by
4296 the constraints of the hardware controllers of CP/ADC/SPI.
4298 The order of the SGPR registers is defined, but the compiler can specify which
4299 ones are actually setup in the kernel descriptor using the ``enable_sgpr_*`` bit
4300 fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used
4301 for enabled registers are dense starting at SGPR0: the first enabled register is
4302 SGPR0, the next enabled register is SGPR1 etc.; disabled registers do not have
4305 The initial SGPRs comprise up to 16 User SRGPs that are set by CP and apply to
4306 all wavefronts of the grid. It is possible to specify more than 16 User SGPRs
4307 using the ``enable_sgpr_*`` bit fields, in which case only the first 16 are
4308 actually initialized. These are then immediately followed by the System SGPRs
4309 that are set up by ADC/SPI and can have different values for each wavefront of
4312 SGPR register initial state is defined in
4313 :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
4315 .. table:: SGPR Register Set Up Order
4316 :name: amdgpu-amdhsa-sgpr-register-set-up-order-table
4318 ========== ========================== ====== ==============================
4319 SGPR Order Name Number Description
4320 (kernel descriptor enable of
4322 ========== ========================== ====== ==============================
4323 First Private Segment Buffer 4 See
4324 (enable_sgpr_private :ref:`amdgpu-amdhsa-kernel-prolog-private-segment-buffer`.
4326 then Dispatch Ptr 2 64-bit address of AQL dispatch
4327 (enable_sgpr_dispatch_ptr) packet for kernel dispatch
4329 then Queue Ptr 2 64-bit address of amd_queue_t
4330 (enable_sgpr_queue_ptr) object for AQL queue on which
4331 the dispatch packet was
4333 then Kernarg Segment Ptr 2 64-bit address of Kernarg
4334 (enable_sgpr_kernarg segment. This is directly
4335 _segment_ptr) copied from the
4336 kernarg_address in the kernel
4339 Having CP load it once avoids
4340 loading it at the beginning of
4342 then Dispatch Id 2 64-bit Dispatch ID of the
4343 (enable_sgpr_dispatch_id) dispatch packet being
4345 then Flat Scratch Init 2 See
4346 (enable_sgpr_flat_scratch :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`.
4348 then Private Segment Size 1 The 32-bit byte size of a
4349 (enable_sgpr_private single work-item's memory
4350 _segment_size) allocation. This is the
4351 value from the kernel
4352 dispatch packet Private
4353 Segment Byte Size rounded up
4354 by CP to a multiple of
4357 Having CP load it once avoids
4358 loading it at the beginning of
4361 This is not used for
4362 GFX7-GFX8 since it is the same
4363 value as the second SGPR of
4364 Flat Scratch Init. However, it
4365 may be needed for GFX9-GFX10 which
4366 changes the meaning of the
4367 Flat Scratch Init value.
4368 then Work-Group Id X 1 32-bit work-group id in X
4369 (enable_sgpr_workgroup_id dimension of grid for
4371 then Work-Group Id Y 1 32-bit work-group id in Y
4372 (enable_sgpr_workgroup_id dimension of grid for
4374 then Work-Group Id Z 1 32-bit work-group id in Z
4375 (enable_sgpr_workgroup_id dimension of grid for
4377 then Work-Group Info 1 {first_wavefront, 14'b0000,
4378 (enable_sgpr_workgroup ordered_append_term[10:0],
4379 _info) threadgroup_size_in_wavefronts[5:0]}
4380 then Scratch Wavefront Offset 1 See
4381 (enable_sgpr_private :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`.
4382 _segment_wavefront_offset) and
4383 :ref:`amdgpu-amdhsa-kernel-prolog-private-segment-buffer`.
4384 ========== ========================== ====== ==============================
4386 The order of the VGPR registers is defined, but the compiler can specify which
4387 ones are actually setup in the kernel descriptor using the ``enable_vgpr*`` bit
4388 fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used
4389 for enabled registers are dense starting at VGPR0: the first enabled register is
4390 VGPR0, the next enabled register is VGPR1 etc.; disabled registers do not have a
4393 There are different methods used for the VGPR initial state:
4395 * Unless the *Target Properties* column of :ref:`amdgpu-processor-table`
4396 specifies otherwise, a separate VGPR register is used per work-item ID. The
4397 VGPR register initial state for this method is defined in
4398 :ref:`amdgpu-amdhsa-vgpr-register-set-up-order-for-unpacked-work-item-id-method-table`.
4399 * If *Target Properties* column of :ref:`amdgpu-processor-table`
4400 specifies *Packed work-item IDs*, the initial value of VGPR0 register is used
4401 for all work-item IDs. The register layout for this method is defined in
4402 :ref:`amdgpu-amdhsa-register-layout-for-packed-work-item-id-method-table`.
4404 .. table:: VGPR Register Set Up Order for Unpacked Work-Item ID Method
4405 :name: amdgpu-amdhsa-vgpr-register-set-up-order-for-unpacked-work-item-id-method-table
4407 ========== ========================== ====== ==============================
4408 VGPR Order Name Number Description
4409 (kernel descriptor enable of
4411 ========== ========================== ====== ==============================
4412 First Work-Item Id X 1 32-bit work-item id in X
4413 (Always initialized) dimension of work-group for
4415 then Work-Item Id Y 1 32-bit work-item id in Y
4416 (enable_vgpr_workitem_id dimension of work-group for
4417 > 0) wavefront lane.
4418 then Work-Item Id Z 1 32-bit work-item id in Z
4419 (enable_vgpr_workitem_id dimension of work-group for
4420 > 1) wavefront lane.
4421 ========== ========================== ====== ==============================
4425 .. table:: Register Layout for Packed Work-Item ID Method
4426 :name: amdgpu-amdhsa-register-layout-for-packed-work-item-id-method-table
4428 ======= ======= ================ =========================================
4429 Bits Size Field Name Description
4430 ======= ======= ================ =========================================
4431 0:9 10 bits Work-Item Id X Work-item id in X
4432 dimension of work-group for
4437 10:19 10 bits Work-Item Id Y Work-item id in Y
4438 dimension of work-group for
4441 Initialized if enable_vgpr_workitem_id >
4442 0, otherwise set to 0.
4443 20:29 10 bits Work-Item Id Z Work-item id in Z
4444 dimension of work-group for
4447 Initialized if enable_vgpr_workitem_id >
4448 1, otherwise set to 0.
4449 30:31 2 bits Reserved, set to 0.
4450 ======= ======= ================ =========================================
4452 The setting of registers is done by GPU CP/ADC/SPI hardware as follows:
4454 1. SGPRs before the Work-Group Ids are set by CP using the 16 User Data
4456 2. Work-group Id registers X, Y, Z are set by ADC which supports any
4457 combination including none.
4458 3. Scratch Wavefront Offset is set by SPI in a per wavefront basis which is why
4459 its value cannot be included with the flat scratch init value which is per
4460 queue (see :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`).
4461 4. The VGPRs are set by SPI which only supports specifying either (X), (X, Y)
4463 5. Flat Scratch register pair initialization is described in
4464 :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`.
4466 The global segment can be accessed either using buffer instructions (GFX6 which
4467 has V# 64-bit address support), flat instructions (GFX7-GFX10), or global
4468 instructions (GFX9-GFX10).
4470 If buffer operations are used, then the compiler can generate a V# with the
4471 following properties:
4475 * ATC: 1 if IOMMU present (such as APU)
4477 * MTYPE set to support memory coherence that matches the runtime (such as CC for
4478 APU and NC for dGPU).
4480 .. _amdgpu-amdhsa-kernel-prolog:
4485 The compiler performs initialization in the kernel prologue depending on the
4486 target and information about things like stack usage in the kernel and called
4487 functions. Some of this initialization requires the compiler to request certain
4488 User and System SGPRs be present in the
4489 :ref:`amdgpu-amdhsa-initial-kernel-execution-state` via the
4490 :ref:`amdgpu-amdhsa-kernel-descriptor`.
4492 .. _amdgpu-amdhsa-kernel-prolog-cfi:
4497 1. The CFI return address is undefined.
4499 2. The CFI CFA is defined using an expression which evaluates to a location
4500 description that comprises one memory location description for the
4501 ``DW_ASPACE_AMDGPU_private_lane`` address space address ``0``.
4503 .. _amdgpu-amdhsa-kernel-prolog-m0:
4509 The M0 register must be initialized with a value at least the total LDS size
4510 if the kernel may access LDS via DS or flat operations. Total LDS size is
4511 available in dispatch packet. For M0, it is also possible to use maximum
4512 possible value of LDS for given target (0x7FFF for GFX6 and 0xFFFF for
4515 The M0 register is not used for range checking LDS accesses and so does not
4516 need to be initialized in the prolog.
4518 .. _amdgpu-amdhsa-kernel-prolog-stack-pointer:
4523 If the kernel has function calls it must set up the ABI stack pointer described
4524 in :ref:`amdgpu-amdhsa-function-call-convention-non-kernel-functions` by setting
4525 SGPR32 to the unswizzled scratch offset of the address past the last local
4528 .. _amdgpu-amdhsa-kernel-prolog-frame-pointer:
4533 If the kernel needs a frame pointer for the reasons defined in
4534 ``SIFrameLowering`` then SGPR33 is used and is always set to ``0`` in the
4535 kernel prolog. If a frame pointer is not required then all uses of the frame
4536 pointer are replaced with immediate ``0`` offsets.
4538 .. _amdgpu-amdhsa-kernel-prolog-flat-scratch:
4543 There are different methods used for initializing flat scratch:
4545 * If the *Target Properties* column of :ref:`amdgpu-processor-table`
4546 specifies *Does not support generic address space*:
4548 Flat scratch is not supported and there is no flat scratch register pair.
4550 * If the *Target Properties* column of :ref:`amdgpu-processor-table`
4551 specifies *Offset flat scratch*:
4553 If the kernel or any function it calls may use flat operations to access
4554 scratch memory, the prolog code must set up the FLAT_SCRATCH register pair
4555 (FLAT_SCRATCH_LO/FLAT_SCRATCH_HI). Initialization uses Flat Scratch Init and
4556 Scratch Wavefront Offset SGPR registers (see
4557 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`):
4559 1. The low word of Flat Scratch Init is the 32-bit byte offset from
4560 ``SH_HIDDEN_PRIVATE_BASE_VIMID`` to the base of scratch backing memory
4561 being managed by SPI for the queue executing the kernel dispatch. This is
4562 the same value used in the Scratch Segment Buffer V# base address.
4564 CP obtains this from the runtime. (The Scratch Segment Buffer base address
4565 is ``SH_HIDDEN_PRIVATE_BASE_VIMID`` plus this offset.)
4567 The prolog must add the value of Scratch Wavefront Offset to get the
4568 wavefront's byte scratch backing memory offset from
4569 ``SH_HIDDEN_PRIVATE_BASE_VIMID``.
4571 The Scratch Wavefront Offset must also be used as an offset with Private
4572 segment address when using the Scratch Segment Buffer.
4574 Since FLAT_SCRATCH_LO is in units of 256 bytes, the offset must be right
4575 shifted by 8 before moving into FLAT_SCRATCH_HI.
4577 FLAT_SCRATCH_HI corresponds to SGPRn-4 on GFX7, and SGPRn-6 on GFX8 (where
4578 SGPRn is the highest numbered SGPR allocated to the wavefront).
4579 FLAT_SCRATCH_HI is multiplied by 256 (as it is in units of 256 bytes) and
4580 added to ``SH_HIDDEN_PRIVATE_BASE_VIMID`` to calculate the per wavefront
4581 FLAT SCRATCH BASE in flat memory instructions that access the scratch
4583 2. The second word of Flat Scratch Init is 32-bit byte size of a single
4584 work-items scratch memory usage.
4586 CP obtains this from the runtime, and it is always a multiple of DWORD. CP
4587 checks that the value in the kernel dispatch packet Private Segment Byte
4588 Size is not larger and requests the runtime to increase the queue's scratch
4591 CP directly loads from the kernel dispatch packet Private Segment Byte Size
4592 field and rounds up to a multiple of DWORD. Having CP load it once avoids
4593 loading it at the beginning of every wavefront.
4595 The kernel prolog code must move it to FLAT_SCRATCH_LO which is SGPRn-3 on
4596 GFX7 and SGPRn-5 on GFX8. FLAT_SCRATCH_LO is used as the FLAT SCRATCH SIZE
4597 in flat memory instructions.
4599 * If the *Target Properties* column of :ref:`amdgpu-processor-table`
4600 specifies *Absolute flat scratch*:
4602 If the kernel or any function it calls may use flat operations to access
4603 scratch memory, the prolog code must set up the FLAT_SCRATCH register pair
4604 (FLAT_SCRATCH_LO/FLAT_SCRATCH_HI which are in SGPRn-4/SGPRn-3). Initialization
4605 uses Flat Scratch Init and Scratch Wavefront Offset SGPR registers (see
4606 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`):
4608 The Flat Scratch Init is the 64-bit address of the base of scratch backing
4609 memory being managed by SPI for the queue executing the kernel dispatch.
4611 CP obtains this from the runtime.
4613 The kernel prolog must add the value of the wave's Scratch Wavefront Offset
4614 and move the result as a 64-bit value to the FLAT_SCRATCH SGPR register pair
4615 which is SGPRn-6 and SGPRn-5. It is used as the FLAT SCRATCH BASE in flat
4616 memory instructions.
4618 The Scratch Wavefront Offset must also be used as an offset with Private
4619 segment address when using the Scratch Segment Buffer (see
4620 :ref:`amdgpu-amdhsa-kernel-prolog-private-segment-buffer`).
4622 * If the *Target Properties* column of :ref:`amdgpu-processor-table`
4623 specifies *Architected flat scratch*:
4625 If ENABLE_PRIVATE_SEGMENT is enabled in
4626 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table` then the FLAT_SCRATCH
4627 register pair will be initialized to the 64-bit address of the base of scratch
4628 backing memory being managed by SPI for the queue executing the kernel
4629 dispatch plus the value of the wave's Scratch Wavefront Offset for use as the
4630 flat scratch base in flat memory instructions.
4632 .. _amdgpu-amdhsa-kernel-prolog-private-segment-buffer:
4634 Private Segment Buffer
4635 ++++++++++++++++++++++
4637 If the *Target Properties* column of :ref:`amdgpu-processor-table` specifies
4638 *Architected flat scratch* then a Private Segment Buffer is not supported.
4639 Instead the flat SCRATCH instructions are used.
4641 Otherwise, Private Segment Buffer SGPR register is used to initialize 4 SGPRs
4642 that are used as a V# to access scratch. CP uses the value provided by the
4643 runtime. It is used, together with Scratch Wavefront Offset as an offset, to
4644 access the private memory space using a segment address. See
4645 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`.
4647 The scratch V# is a four-aligned SGPR and always selected for the kernel as
4650 - If it is known during instruction selection that there is stack usage,
4651 SGPR0-3 is reserved for use as the scratch V#. Stack usage is assumed if
4652 optimizations are disabled (``-O0``), if stack objects already exist (for
4653 locals, etc.), or if there are any function calls.
4655 - Otherwise, four high numbered SGPRs beginning at a four-aligned SGPR index
4656 are reserved for the tentative scratch V#. These will be used if it is
4657 determined that spilling is needed.
4659 - If no use is made of the tentative scratch V#, then it is unreserved,
4660 and the register count is determined ignoring it.
4661 - If use is made of the tentative scratch V#, then its register numbers
4662 are shifted to the first four-aligned SGPR index after the highest one
4663 allocated by the register allocator, and all uses are updated. The
4664 register count includes them in the shifted location.
4665 - In either case, if the processor has the SGPR allocation bug, the
4666 tentative allocation is not shifted or unreserved in order to ensure
4667 the register count is higher to workaround the bug.
4671 This approach of using a tentative scratch V# and shifting the register
4672 numbers if used avoids having to perform register allocation a second
4673 time if the tentative V# is eliminated. This is more efficient and
4674 avoids the problem that the second register allocation may perform
4675 spilling which will fail as there is no longer a scratch V#.
4677 When the kernel prolog code is being emitted it is known whether the scratch V#
4678 described above is actually used. If it is, the prolog code must set it up by
4679 copying the Private Segment Buffer to the scratch V# registers and then adding
4680 the Private Segment Wavefront Offset to the queue base address in the V#. The
4681 result is a V# with a base address pointing to the beginning of the wavefront
4682 scratch backing memory.
4684 The Private Segment Buffer is always requested, but the Private Segment
4685 Wavefront Offset is only requested if it is used (see
4686 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
4688 .. _amdgpu-amdhsa-memory-model:
4693 This section describes the mapping of the LLVM memory model onto AMDGPU machine
4694 code (see :ref:`memmodel`).
4696 The AMDGPU backend supports the memory synchronization scopes specified in
4697 :ref:`amdgpu-memory-scopes`.
4699 The code sequences used to implement the memory model specify the order of
4700 instructions that a single thread must execute. The ``s_waitcnt`` and cache
4701 management instructions such as ``buffer_wbinvl1_vol`` are defined with respect
4702 to other memory instructions executed by the same thread. This allows them to be
4703 moved earlier or later which can allow them to be combined with other instances
4704 of the same instruction, or hoisted/sunk out of loops to improve performance.
4705 Only the instructions related to the memory model are given; additional
4706 ``s_waitcnt`` instructions are required to ensure registers are defined before
4707 being used. These may be able to be combined with the memory model ``s_waitcnt``
4708 instructions as described above.
4710 The AMDGPU backend supports the following memory models:
4712 HSA Memory Model [HSA]_
4713 The HSA memory model uses a single happens-before relation for all address
4714 spaces (see :ref:`amdgpu-address-spaces`).
4715 OpenCL Memory Model [OpenCL]_
4716 The OpenCL memory model which has separate happens-before relations for the
4717 global and local address spaces. Only a fence specifying both global and
4718 local address space, and seq_cst instructions join the relationships. Since
4719 the LLVM ``memfence`` instruction does not allow an address space to be
4720 specified the OpenCL fence has to conservatively assume both local and
4721 global address space was specified. However, optimizations can often be
4722 done to eliminate the additional ``s_waitcnt`` instructions when there are
4723 no intervening memory instructions which access the corresponding address
4724 space. The code sequences in the table indicate what can be omitted for the
4725 OpenCL memory. The target triple environment is used to determine if the
4726 source language is OpenCL (see :ref:`amdgpu-opencl`).
4728 ``ds/flat_load/store/atomic`` instructions to local memory are termed LDS
4731 ``buffer/global/flat_load/store/atomic`` instructions to global memory are
4732 termed vector memory operations.
4734 Private address space uses ``buffer_load/store`` using the scratch V#
4735 (GFX6-GFX8), or ``scratch_load/store`` (GFX9-GFX10). Since only a single thread
4736 is accessing the memory, atomic memory orderings are not meaningful, and all
4737 accesses are treated as non-atomic.
4739 Constant address space uses ``buffer/global_load`` instructions (or equivalent
4740 scalar memory instructions). Since the constant address space contents do not
4741 change during the execution of a kernel dispatch it is not legal to perform
4742 stores, and atomic memory orderings are not meaningful, and all accesses are
4743 treated as non-atomic.
4745 A memory synchronization scope wider than work-group is not meaningful for the
4746 group (LDS) address space and is treated as work-group.
4748 The memory model does not support the region address space which is treated as
4751 Acquire memory ordering is not meaningful on store atomic instructions and is
4752 treated as non-atomic.
4754 Release memory ordering is not meaningful on load atomic instructions and is
4755 treated a non-atomic.
4757 Acquire-release memory ordering is not meaningful on load or store atomic
4758 instructions and is treated as acquire and release respectively.
4760 The memory order also adds the single thread optimization constraints defined in
4762 :ref:`amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-table`.
4764 .. table:: AMDHSA Memory Model Single Thread Optimization Constraints
4765 :name: amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-table
4767 ============ ==============================================================
4768 LLVM Memory Optimization Constraints
4770 ============ ==============================================================
4773 acquire - If a load atomic/atomicrmw then no following load/load
4774 atomic/store/store atomic/atomicrmw/fence instruction can be
4775 moved before the acquire.
4776 - If a fence then same as load atomic, plus no preceding
4777 associated fence-paired-atomic can be moved after the fence.
4778 release - If a store atomic/atomicrmw then no preceding load/load
4779 atomic/store/store atomic/atomicrmw/fence instruction can be
4780 moved after the release.
4781 - If a fence then same as store atomic, plus no following
4782 associated fence-paired-atomic can be moved before the
4784 acq_rel Same constraints as both acquire and release.
4785 seq_cst - If a load atomic then same constraints as acquire, plus no
4786 preceding sequentially consistent load atomic/store
4787 atomic/atomicrmw/fence instruction can be moved after the
4789 - If a store atomic then the same constraints as release, plus
4790 no following sequentially consistent load atomic/store
4791 atomic/atomicrmw/fence instruction can be moved before the
4793 - If an atomicrmw/fence then same constraints as acq_rel.
4794 ============ ==============================================================
4796 The code sequences used to implement the memory model are defined in the
4799 * :ref:`amdgpu-amdhsa-memory-model-gfx6-gfx9`
4800 * :ref:`amdgpu-amdhsa-memory-model-gfx90a`
4801 * :ref:`amdgpu-amdhsa-memory-model-gfx10`
4803 .. _amdgpu-amdhsa-memory-model-gfx6-gfx9:
4805 Memory Model GFX6-GFX9
4806 ++++++++++++++++++++++
4810 * Each agent has multiple shader arrays (SA).
4811 * Each SA has multiple compute units (CU).
4812 * Each CU has multiple SIMDs that execute wavefronts.
4813 * The wavefronts for a single work-group are executed in the same CU but may be
4814 executed by different SIMDs.
4815 * Each CU has a single LDS memory shared by the wavefronts of the work-groups
4817 * All LDS operations of a CU are performed as wavefront wide operations in a
4818 global order and involve no caching. Completion is reported to a wavefront in
4820 * The LDS memory has multiple request queues shared by the SIMDs of a
4821 CU. Therefore, the LDS operations performed by different wavefronts of a
4822 work-group can be reordered relative to each other, which can result in
4823 reordering the visibility of vector memory operations with respect to LDS
4824 operations of other wavefronts in the same work-group. A ``s_waitcnt
4825 lgkmcnt(0)`` is required to ensure synchronization between LDS operations and
4826 vector memory operations between wavefronts of a work-group, but not between
4827 operations performed by the same wavefront.
4828 * The vector memory operations are performed as wavefront wide operations and
4829 completion is reported to a wavefront in execution order. The exception is
4830 that for GFX7-GFX9 ``flat_load/store/atomic`` instructions can report out of
4831 vector memory order if they access LDS memory, and out of LDS operation order
4832 if they access global memory.
4833 * The vector memory operations access a single vector L1 cache shared by all
4834 SIMDs a CU. Therefore, no special action is required for coherence between the
4835 lanes of a single wavefront, or for coherence between wavefronts in the same
4836 work-group. A ``buffer_wbinvl1_vol`` is required for coherence between
4837 wavefronts executing in different work-groups as they may be executing on
4839 * The scalar memory operations access a scalar L1 cache shared by all wavefronts
4840 on a group of CUs. The scalar and vector L1 caches are not coherent. However,
4841 scalar operations are used in a restricted way so do not impact the memory
4842 model. See :ref:`amdgpu-amdhsa-memory-spaces`.
4843 * The vector and scalar memory operations use an L2 cache shared by all CUs on
4845 * The L2 cache has independent channels to service disjoint ranges of virtual
4847 * Each CU has a separate request queue per channel. Therefore, the vector and
4848 scalar memory operations performed by wavefronts executing in different
4849 work-groups (which may be executing on different CUs) of an agent can be
4850 reordered relative to each other. A ``s_waitcnt vmcnt(0)`` is required to
4851 ensure synchronization between vector memory operations of different CUs. It
4852 ensures a previous vector memory operation has completed before executing a
4853 subsequent vector memory or LDS operation and so can be used to meet the
4854 requirements of acquire and release.
4855 * The L2 cache can be kept coherent with other agents on some targets, or ranges
4856 of virtual addresses can be set up to bypass it to ensure system coherence.
4858 Scalar memory operations are only used to access memory that is proven to not
4859 change during the execution of the kernel dispatch. This includes constant
4860 address space and global address space for program scope ``const`` variables.
4861 Therefore, the kernel machine code does not have to maintain the scalar cache to
4862 ensure it is coherent with the vector caches. The scalar and vector caches are
4863 invalidated between kernel dispatches by CP since constant address space data
4864 may change between kernel dispatch executions. See
4865 :ref:`amdgpu-amdhsa-memory-spaces`.
4867 The one exception is if scalar writes are used to spill SGPR registers. In this
4868 case the AMDGPU backend ensures the memory location used to spill is never
4869 accessed by vector memory operations at the same time. If scalar writes are used
4870 then a ``s_dcache_wb`` is inserted before the ``s_endpgm`` and before a function
4871 return since the locations may be used for vector memory instructions by a
4872 future wavefront that uses the same scratch area, or a function call that
4873 creates a frame at the same address, respectively. There is no need for a
4874 ``s_dcache_inv`` as all scalar writes are write-before-read in the same thread.
4876 For kernarg backing memory:
4878 * CP invalidates the L1 cache at the start of each kernel dispatch.
4879 * On dGPU the kernarg backing memory is allocated in host memory accessed as
4880 MTYPE UC (uncached) to avoid needing to invalidate the L2 cache. This also
4881 causes it to be treated as non-volatile and so is not invalidated by
4883 * On APU the kernarg backing memory it is accessed as MTYPE CC (cache coherent)
4884 and so the L2 cache will be coherent with the CPU and other agents.
4886 Scratch backing memory (which is used for the private address space) is accessed
4887 with MTYPE NC_NV (non-coherent non-volatile). Since the private address space is
4888 only accessed by a single thread, and is always write-before-read, there is
4889 never a need to invalidate these entries from the L1 cache. Hence all cache
4890 invalidates are done as ``*_vol`` to only invalidate the volatile cache lines.
4892 The code sequences used to implement the memory model for GFX6-GFX9 are defined
4893 in table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table`.
4895 .. table:: AMDHSA Memory Model Code Sequences GFX6-GFX9
4896 :name: amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table
4898 ============ ============ ============== ========== ================================
4899 LLVM Instr LLVM Memory LLVM Memory AMDGPU AMDGPU Machine Code
4900 Ordering Sync Scope Address GFX6-GFX9
4902 ============ ============ ============== ========== ================================
4904 ------------------------------------------------------------------------------------
4905 load *none* *none* - global - !volatile & !nontemporal
4907 - private 1. buffer/global/flat_load
4909 - !volatile & nontemporal
4911 1. buffer/global/flat_load
4916 1. buffer/global/flat_load
4918 2. s_waitcnt vmcnt(0)
4920 - Must happen before
4921 any following volatile
4932 load *none* *none* - local 1. ds_load
4933 store *none* *none* - global - !volatile & !nontemporal
4935 - private 1. buffer/global/flat_store
4937 - !volatile & nontemporal
4939 1. buffer/global/flat_store
4944 1. buffer/global/flat_store
4945 2. s_waitcnt vmcnt(0)
4947 - Must happen before
4948 any following volatile
4959 store *none* *none* - local 1. ds_store
4960 **Unordered Atomic**
4961 ------------------------------------------------------------------------------------
4962 load atomic unordered *any* *any* *Same as non-atomic*.
4963 store atomic unordered *any* *any* *Same as non-atomic*.
4964 atomicrmw unordered *any* *any* *Same as monotonic atomic*.
4965 **Monotonic Atomic**
4966 ------------------------------------------------------------------------------------
4967 load atomic monotonic - singlethread - global 1. buffer/global/ds/flat_load
4969 - workgroup - generic
4970 load atomic monotonic - agent - global 1. buffer/global/flat_load
4971 - system - generic glc=1
4972 store atomic monotonic - singlethread - global 1. buffer/global/flat_store
4973 - wavefront - generic
4977 store atomic monotonic - singlethread - local 1. ds_store
4980 atomicrmw monotonic - singlethread - global 1. buffer/global/flat_atomic
4981 - wavefront - generic
4985 atomicrmw monotonic - singlethread - local 1. ds_atomic
4989 ------------------------------------------------------------------------------------
4990 load atomic acquire - singlethread - global 1. buffer/global/ds/flat_load
4993 load atomic acquire - workgroup - global 1. buffer/global_load
4994 load atomic acquire - workgroup - local 1. ds/flat_load
4995 - generic 2. s_waitcnt lgkmcnt(0)
4998 - Must happen before
5007 older than a local load
5011 load atomic acquire - agent - global 1. buffer/global_load
5013 2. s_waitcnt vmcnt(0)
5015 - Must happen before
5023 3. buffer_wbinvl1_vol
5025 - Must happen before
5035 load atomic acquire - agent - generic 1. flat_load glc=1
5036 - system 2. s_waitcnt vmcnt(0) &
5041 - Must happen before
5044 - Ensures the flat_load
5049 3. buffer_wbinvl1_vol
5051 - Must happen before
5061 atomicrmw acquire - singlethread - global 1. buffer/global/ds/flat_atomic
5064 atomicrmw acquire - workgroup - global 1. buffer/global_atomic
5065 atomicrmw acquire - workgroup - local 1. ds/flat_atomic
5066 - generic 2. s_waitcnt lgkmcnt(0)
5069 - Must happen before
5082 atomicrmw acquire - agent - global 1. buffer/global_atomic
5083 - system 2. s_waitcnt vmcnt(0)
5085 - Must happen before
5094 3. buffer_wbinvl1_vol
5096 - Must happen before
5106 atomicrmw acquire - agent - generic 1. flat_atomic
5107 - system 2. s_waitcnt vmcnt(0) &
5112 - Must happen before
5121 3. buffer_wbinvl1_vol
5123 - Must happen before
5133 fence acquire - singlethread *none* *none*
5135 fence acquire - workgroup *none* 1. s_waitcnt lgkmcnt(0)
5140 - However, since LLVM
5165 fence-paired-atomic).
5166 - Must happen before
5177 fence-paired-atomic.
5179 fence acquire - agent *none* 1. s_waitcnt lgkmcnt(0) &
5186 - However, since LLVM
5194 - Could be split into
5203 - s_waitcnt vmcnt(0)
5214 fence-paired-atomic).
5215 - s_waitcnt lgkmcnt(0)
5226 fence-paired-atomic).
5227 - Must happen before
5241 fence-paired-atomic.
5243 2. buffer_wbinvl1_vol
5245 - Must happen before any
5246 following global/generic
5256 ------------------------------------------------------------------------------------
5257 store atomic release - singlethread - global 1. buffer/global/ds/flat_store
5260 store atomic release - workgroup - global 1. s_waitcnt lgkmcnt(0)
5269 - Must happen before
5280 2. buffer/global/flat_store
5281 store atomic release - workgroup - local 1. ds_store
5282 store atomic release - agent - global 1. s_waitcnt lgkmcnt(0) &
5283 - system - generic vmcnt(0)
5289 - Could be split into
5298 - s_waitcnt vmcnt(0)
5305 - s_waitcnt lgkmcnt(0)
5312 - Must happen before
5323 2. buffer/global/flat_store
5324 atomicrmw release - singlethread - global 1. buffer/global/ds/flat_atomic
5327 atomicrmw release - workgroup - global 1. s_waitcnt lgkmcnt(0)
5336 - Must happen before
5347 2. buffer/global/flat_atomic
5348 atomicrmw release - workgroup - local 1. ds_atomic
5349 atomicrmw release - agent - global 1. s_waitcnt lgkmcnt(0) &
5350 - system - generic vmcnt(0)
5354 - Could be split into
5363 - s_waitcnt vmcnt(0)
5370 - s_waitcnt lgkmcnt(0)
5377 - Must happen before
5388 2. buffer/global/flat_atomic
5389 fence release - singlethread *none* *none*
5391 fence release - workgroup *none* 1. s_waitcnt lgkmcnt(0)
5396 - However, since LLVM
5417 - Must happen before
5426 fence-paired-atomic).
5433 fence-paired-atomic.
5435 fence release - agent *none* 1. s_waitcnt lgkmcnt(0) &
5446 - However, since LLVM
5461 - Could be split into
5470 - s_waitcnt vmcnt(0)
5477 - s_waitcnt lgkmcnt(0)
5484 - Must happen before
5493 fence-paired-atomic).
5500 fence-paired-atomic.
5502 **Acquire-Release Atomic**
5503 ------------------------------------------------------------------------------------
5504 atomicrmw acq_rel - singlethread - global 1. buffer/global/ds/flat_atomic
5507 atomicrmw acq_rel - workgroup - global 1. s_waitcnt lgkmcnt(0)
5516 - Must happen before
5527 2. buffer/global_atomic
5529 atomicrmw acq_rel - workgroup - local 1. ds_atomic
5530 2. s_waitcnt lgkmcnt(0)
5533 - Must happen before
5542 older than the local load
5546 atomicrmw acq_rel - workgroup - generic 1. s_waitcnt lgkmcnt(0)
5555 - Must happen before
5567 3. s_waitcnt lgkmcnt(0)
5570 - Must happen before
5579 older than a local load
5583 atomicrmw acq_rel - agent - global 1. s_waitcnt lgkmcnt(0) &
5588 - Could be split into
5597 - s_waitcnt vmcnt(0)
5604 - s_waitcnt lgkmcnt(0)
5611 - Must happen before
5622 2. buffer/global_atomic
5623 3. s_waitcnt vmcnt(0)
5625 - Must happen before
5634 4. buffer_wbinvl1_vol
5636 - Must happen before
5646 atomicrmw acq_rel - agent - generic 1. s_waitcnt lgkmcnt(0) &
5651 - Could be split into
5660 - s_waitcnt vmcnt(0)
5667 - s_waitcnt lgkmcnt(0)
5674 - Must happen before
5686 3. s_waitcnt vmcnt(0) &
5691 - Must happen before
5700 4. buffer_wbinvl1_vol
5702 - Must happen before
5712 fence acq_rel - singlethread *none* *none*
5714 fence acq_rel - workgroup *none* 1. s_waitcnt lgkmcnt(0)
5734 - Must happen before
5757 acquire-fence-paired-atomic)
5778 release-fence-paired-atomic).
5783 fence acq_rel - agent *none* 1. s_waitcnt lgkmcnt(0) &
5790 - However, since LLVM
5798 - Could be split into
5807 - s_waitcnt vmcnt(0)
5814 - s_waitcnt lgkmcnt(0)
5821 - Must happen before
5826 global/local/generic
5835 acquire-fence-paired-atomic)
5847 global/local/generic
5856 release-fence-paired-atomic).
5861 2. buffer_wbinvl1_vol
5863 - Must happen before
5877 **Sequential Consistent Atomic**
5878 ------------------------------------------------------------------------------------
5879 load atomic seq_cst - singlethread - global *Same as corresponding
5880 - wavefront - local load atomic acquire,
5881 - generic except must generated
5882 all instructions even
5884 load atomic seq_cst - workgroup - global 1. s_waitcnt lgkmcnt(0)
5900 lgkmcnt(0) and so do
5932 order. The s_waitcnt
5933 could be placed after
5937 make the s_waitcnt be
5944 instructions same as
5947 except must generated
5948 all instructions even
5950 load atomic seq_cst - workgroup - local *Same as corresponding
5951 load atomic acquire,
5952 except must generated
5953 all instructions even
5956 load atomic seq_cst - agent - global 1. s_waitcnt lgkmcnt(0) &
5957 - system - generic vmcnt(0)
5959 - Could be split into
5968 - s_waitcnt lgkmcnt(0)
5981 lgkmcnt(0) and so do
5984 - s_waitcnt vmcnt(0)
6029 order. The s_waitcnt
6030 could be placed after
6034 make the s_waitcnt be
6041 instructions same as
6044 except must generated
6045 all instructions even
6047 store atomic seq_cst - singlethread - global *Same as corresponding
6048 - wavefront - local store atomic release,
6049 - workgroup - generic except must generated
6050 - agent all instructions even
6051 - system for OpenCL.*
6052 atomicrmw seq_cst - singlethread - global *Same as corresponding
6053 - wavefront - local atomicrmw acq_rel,
6054 - workgroup - generic except must generated
6055 - agent all instructions even
6056 - system for OpenCL.*
6057 fence seq_cst - singlethread *none* *Same as corresponding
6058 - wavefront fence acq_rel,
6059 - workgroup except must generated
6060 - agent all instructions even
6061 - system for OpenCL.*
6062 ============ ============ ============== ========== ================================
6064 .. _amdgpu-amdhsa-memory-model-gfx90a:
6071 * Each agent has multiple shader arrays (SA).
6072 * Each SA has multiple compute units (CU).
6073 * Each CU has multiple SIMDs that execute wavefronts.
6074 * The wavefronts for a single work-group are executed in the same CU but may be
6075 executed by different SIMDs. The exception is when in tgsplit execution mode
6076 when the wavefronts may be executed by different SIMDs in different CUs.
6077 * Each CU has a single LDS memory shared by the wavefronts of the work-groups
6078 executing on it. The exception is when in tgsplit execution mode when no LDS
6079 is allocated as wavefronts of the same work-group can be in different CUs.
6080 * All LDS operations of a CU are performed as wavefront wide operations in a
6081 global order and involve no caching. Completion is reported to a wavefront in
6083 * The LDS memory has multiple request queues shared by the SIMDs of a
6084 CU. Therefore, the LDS operations performed by different wavefronts of a
6085 work-group can be reordered relative to each other, which can result in
6086 reordering the visibility of vector memory operations with respect to LDS
6087 operations of other wavefronts in the same work-group. A ``s_waitcnt
6088 lgkmcnt(0)`` is required to ensure synchronization between LDS operations and
6089 vector memory operations between wavefronts of a work-group, but not between
6090 operations performed by the same wavefront.
6091 * The vector memory operations are performed as wavefront wide operations and
6092 completion is reported to a wavefront in execution order. The exception is
6093 that ``flat_load/store/atomic`` instructions can report out of vector memory
6094 order if they access LDS memory, and out of LDS operation order if they access
6096 * The vector memory operations access a single vector L1 cache shared by all
6097 SIMDs a CU. Therefore:
6099 * No special action is required for coherence between the lanes of a single
6102 * No special action is required for coherence between wavefronts in the same
6103 work-group since they execute on the same CU. The exception is when in
6104 tgsplit execution mode as wavefronts of the same work-group can be in
6105 different CUs and so a ``buffer_wbinvl1_vol`` is required as described in
6108 * A ``buffer_wbinvl1_vol`` is required for coherence between wavefronts
6109 executing in different work-groups as they may be executing on different
6112 * The scalar memory operations access a scalar L1 cache shared by all wavefronts
6113 on a group of CUs. The scalar and vector L1 caches are not coherent. However,
6114 scalar operations are used in a restricted way so do not impact the memory
6115 model. See :ref:`amdgpu-amdhsa-memory-spaces`.
6116 * The vector and scalar memory operations use an L2 cache shared by all CUs on
6119 * The L2 cache has independent channels to service disjoint ranges of virtual
6121 * Each CU has a separate request queue per channel. Therefore, the vector and
6122 scalar memory operations performed by wavefronts executing in different
6123 work-groups (which may be executing on different CUs), or the same
6124 work-group if executing in tgsplit mode, of an agent can be reordered
6125 relative to each other. A ``s_waitcnt vmcnt(0)`` is required to ensure
6126 synchronization between vector memory operations of different CUs. It
6127 ensures a previous vector memory operation has completed before executing a
6128 subsequent vector memory or LDS operation and so can be used to meet the
6129 requirements of acquire and release.
6130 * The L2 cache of one agent can be kept coherent with other agents by:
6131 using the MTYPE RW (read-write) or MTYPE CC (cache-coherent) with the PTE
6132 C-bit for memory local to the L2; and using the MTYPE NC (non-coherent) with
6133 the PTE C-bit set or MTYPE UC (uncached) for memory not local to the L2.
6135 * Any local memory cache lines will be automatically invalidated by writes
6136 from CUs associated with other L2 caches, or writes from the CPU, due to
6137 the cache probe caused by coherent requests. Coherent requests are caused
6138 by GPU accesses to pages with the PTE C-bit set, by CPU accesses over
6139 XGMI, and by PCIe requests that are configured to be coherent requests.
6140 * XGMI accesses from the CPU to local memory may be cached on the CPU.
6141 Subsequent access from the GPU will automatically invalidate or writeback
6142 the CPU cache due to the L2 probe filter and and the PTE C-bit being set.
6143 * Since all work-groups on the same agent share the same L2, no L2
6144 invalidation or writeback is required for coherence.
6145 * To ensure coherence of local and remote memory writes of work-groups in
6146 different agents a ``buffer_wbl2`` is required. It will writeback dirty L2
6147 cache lines of MTYPE RW (used for local coarse grain memory) and MTYPE NC
6148 ()used for remote coarse grain memory). Note that MTYPE CC (used for local
6149 fine grain memory) causes write through to DRAM, and MTYPE UC (used for
6150 remote fine grain memory) bypasses the L2, so both will never result in
6151 dirty L2 cache lines.
6152 * To ensure coherence of local and remote memory reads of work-groups in
6153 different agents a ``buffer_invl2`` is required. It will invalidate L2
6154 cache lines with MTYPE NC (used for remote coarse grain memory). Note that
6155 MTYPE CC (used for local fine grain memory) and MTYPE RW (used for local
6156 coarse memory) cause local reads to be invalidated by remote writes with
6157 with the PTE C-bit so these cache lines are not invalidated. Note that
6158 MTYPE UC (used for remote fine grain memory) bypasses the L2, so will
6159 never result in L2 cache lines that need to be invalidated.
6161 * PCIe access from the GPU to the CPU memory is kept coherent by using the
6162 MTYPE UC (uncached) which bypasses the L2.
6164 Scalar memory operations are only used to access memory that is proven to not
6165 change during the execution of the kernel dispatch. This includes constant
6166 address space and global address space for program scope ``const`` variables.
6167 Therefore, the kernel machine code does not have to maintain the scalar cache to
6168 ensure it is coherent with the vector caches. The scalar and vector caches are
6169 invalidated between kernel dispatches by CP since constant address space data
6170 may change between kernel dispatch executions. See
6171 :ref:`amdgpu-amdhsa-memory-spaces`.
6173 The one exception is if scalar writes are used to spill SGPR registers. In this
6174 case the AMDGPU backend ensures the memory location used to spill is never
6175 accessed by vector memory operations at the same time. If scalar writes are used
6176 then a ``s_dcache_wb`` is inserted before the ``s_endpgm`` and before a function
6177 return since the locations may be used for vector memory instructions by a
6178 future wavefront that uses the same scratch area, or a function call that
6179 creates a frame at the same address, respectively. There is no need for a
6180 ``s_dcache_inv`` as all scalar writes are write-before-read in the same thread.
6182 For kernarg backing memory:
6184 * CP invalidates the L1 cache at the start of each kernel dispatch.
6185 * On dGPU over XGMI or PCIe the kernarg backing memory is allocated in host
6186 memory accessed as MTYPE UC (uncached) to avoid needing to invalidate the L2
6187 cache. This also causes it to be treated as non-volatile and so is not
6188 invalidated by ``*_vol``.
6189 * On APU the kernarg backing memory is accessed as MTYPE CC (cache coherent) and
6190 so the L2 cache will be coherent with the CPU and other agents.
6192 Scratch backing memory (which is used for the private address space) is accessed
6193 with MTYPE NC_NV (non-coherent non-volatile). Since the private address space is
6194 only accessed by a single thread, and is always write-before-read, there is
6195 never a need to invalidate these entries from the L1 cache. Hence all cache
6196 invalidates are done as ``*_vol`` to only invalidate the volatile cache lines.
6198 The code sequences used to implement the memory model for GFX90A are defined
6199 in table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx90a-table`.
6201 .. table:: AMDHSA Memory Model Code Sequences GFX90A
6202 :name: amdgpu-amdhsa-memory-model-code-sequences-gfx90a-table
6204 ============ ============ ============== ========== ================================
6205 LLVM Instr LLVM Memory LLVM Memory AMDGPU AMDGPU Machine Code
6206 Ordering Sync Scope Address GFX90A
6208 ============ ============ ============== ========== ================================
6210 ------------------------------------------------------------------------------------
6211 load *none* *none* - global - !volatile & !nontemporal
6213 - private 1. buffer/global/flat_load
6215 - !volatile & nontemporal
6217 1. buffer/global/flat_load
6222 1. buffer/global/flat_load
6224 2. s_waitcnt vmcnt(0)
6226 - Must happen before
6227 any following volatile
6238 load *none* *none* - local 1. ds_load
6239 store *none* *none* - global - !volatile & !nontemporal
6241 - private 1. buffer/global/flat_store
6243 - !volatile & nontemporal
6245 1. buffer/global/flat_store
6250 1. buffer/global/flat_store
6251 2. s_waitcnt vmcnt(0)
6253 - Must happen before
6254 any following volatile
6265 store *none* *none* - local 1. ds_store
6266 **Unordered Atomic**
6267 ------------------------------------------------------------------------------------
6268 load atomic unordered *any* *any* *Same as non-atomic*.
6269 store atomic unordered *any* *any* *Same as non-atomic*.
6270 atomicrmw unordered *any* *any* *Same as monotonic atomic*.
6271 **Monotonic Atomic**
6272 ------------------------------------------------------------------------------------
6273 load atomic monotonic - singlethread - global 1. buffer/global/flat_load
6274 - wavefront - generic
6275 load atomic monotonic - workgroup - global 1. buffer/global/flat_load
6278 - If not TgSplit execution
6281 load atomic monotonic - singlethread - local *If TgSplit execution mode,
6282 - wavefront local address space cannot
6283 - workgroup be used.*
6286 load atomic monotonic - agent - global 1. buffer/global/flat_load
6288 load atomic monotonic - system - global 1. buffer/global/flat_load
6290 store atomic monotonic - singlethread - global 1. buffer/global/flat_store
6291 - wavefront - generic
6294 store atomic monotonic - system - global 1. buffer/global/flat_store
6296 store atomic monotonic - singlethread - local *If TgSplit execution mode,
6297 - wavefront local address space cannot
6298 - workgroup be used.*
6301 atomicrmw monotonic - singlethread - global 1. buffer/global/flat_atomic
6302 - wavefront - generic
6305 atomicrmw monotonic - system - global 1. buffer/global/flat_atomic
6307 atomicrmw monotonic - singlethread - local *If TgSplit execution mode,
6308 - wavefront local address space cannot
6309 - workgroup be used.*
6313 ------------------------------------------------------------------------------------
6314 load atomic acquire - singlethread - global 1. buffer/global/ds/flat_load
6317 load atomic acquire - workgroup - global 1. buffer/global_load glc=1
6319 - If not TgSplit execution
6322 2. s_waitcnt vmcnt(0)
6324 - If not TgSplit execution
6326 - Must happen before the
6327 following buffer_wbinvl1_vol.
6329 3. buffer_wbinvl1_vol
6331 - If not TgSplit execution
6333 - Must happen before
6344 load atomic acquire - workgroup - local *If TgSplit execution mode,
6345 local address space cannot
6349 2. s_waitcnt lgkmcnt(0)
6352 - Must happen before
6361 older than the local load
6365 load atomic acquire - workgroup - generic 1. flat_load glc=1
6367 - If not TgSplit execution
6370 2. s_waitcnt lgkm/vmcnt(0)
6372 - Use lgkmcnt(0) if not
6373 TgSplit execution mode
6374 and vmcnt(0) if TgSplit
6376 - If OpenCL, omit lgkmcnt(0).
6377 - Must happen before
6379 buffer_wbinvl1_vol and any
6380 following global/generic
6387 older than a local load
6391 3. buffer_wbinvl1_vol
6393 - If not TgSplit execution
6400 load atomic acquire - agent - global 1. buffer/global_load
6402 2. s_waitcnt vmcnt(0)
6404 - Must happen before
6412 3. buffer_wbinvl1_vol
6414 - Must happen before
6424 load atomic acquire - system - global 1. buffer/global/flat_load
6426 2. s_waitcnt vmcnt(0)
6428 - Must happen before
6429 following buffer_invl2 and
6439 - Must happen before
6447 stale L1 global data,
6448 nor see stale L2 MTYPE
6450 MTYPE RW and CC memory will
6451 never be stale in L2 due to
6454 load atomic acquire - agent - generic 1. flat_load glc=1
6455 2. s_waitcnt vmcnt(0) &
6458 - If TgSplit execution mode,
6462 - Must happen before
6465 - Ensures the flat_load
6470 3. buffer_wbinvl1_vol
6472 - Must happen before
6482 load atomic acquire - system - generic 1. flat_load glc=1
6483 2. s_waitcnt vmcnt(0) &
6486 - If TgSplit execution mode,
6490 - Must happen before
6494 - Ensures the flat_load
6502 - Must happen before
6510 stale L1 global data,
6511 nor see stale L2 MTYPE
6513 MTYPE RW and CC memory will
6514 never be stale in L2 due to
6517 atomicrmw acquire - singlethread - global 1. buffer/global/flat_atomic
6518 - wavefront - generic
6519 atomicrmw acquire - singlethread - local *If TgSplit execution mode,
6520 - wavefront local address space cannot
6524 atomicrmw acquire - workgroup - global 1. buffer/global_atomic
6525 2. s_waitcnt vmcnt(0)
6527 - If not TgSplit execution
6529 - Must happen before the
6530 following buffer_wbinvl1_vol.
6531 - Ensures the atomicrmw
6536 3. buffer_wbinvl1_vol
6538 - If not TgSplit execution
6540 - Must happen before
6550 atomicrmw acquire - workgroup - local *If TgSplit execution mode,
6551 local address space cannot
6555 2. s_waitcnt lgkmcnt(0)
6558 - Must happen before
6567 older than the local
6571 atomicrmw acquire - workgroup - generic 1. flat_atomic
6572 2. s_waitcnt lgkm/vmcnt(0)
6574 - Use lgkmcnt(0) if not
6575 TgSplit execution mode
6576 and vmcnt(0) if TgSplit
6578 - If OpenCL, omit lgkmcnt(0).
6579 - Must happen before
6581 buffer_wbinvl1_vol and
6594 3. buffer_wbinvl1_vol
6596 - If not TgSplit execution
6603 atomicrmw acquire - agent - global 1. buffer/global_atomic
6604 2. s_waitcnt vmcnt(0)
6606 - Must happen before
6615 3. buffer_wbinvl1_vol
6617 - Must happen before
6627 atomicrmw acquire - system - global 1. buffer/global_atomic
6628 2. s_waitcnt vmcnt(0)
6630 - Must happen before
6631 following buffer_invl2 and
6642 - Must happen before
6650 stale L1 global data,
6651 nor see stale L2 MTYPE
6653 MTYPE RW and CC memory will
6654 never be stale in L2 due to
6657 atomicrmw acquire - agent - generic 1. flat_atomic
6658 2. s_waitcnt vmcnt(0) &
6661 - If TgSplit execution mode,
6665 - Must happen before
6674 3. buffer_wbinvl1_vol
6676 - Must happen before
6686 atomicrmw acquire - system - generic 1. flat_atomic
6687 2. s_waitcnt vmcnt(0) &
6690 - If TgSplit execution mode,
6694 - Must happen before
6707 - Must happen before
6715 stale L1 global data,
6716 nor see stale L2 MTYPE
6718 MTYPE RW and CC memory will
6719 never be stale in L2 due to
6722 fence acquire - singlethread *none* *none*
6724 fence acquire - workgroup *none* 1. s_waitcnt lgkm/vmcnt(0)
6726 - Use lgkmcnt(0) if not
6727 TgSplit execution mode
6728 and vmcnt(0) if TgSplit
6738 - However, since LLVM
6753 - s_waitcnt vmcnt(0)
6765 fence-paired-atomic).
6766 - s_waitcnt lgkmcnt(0)
6777 fence-paired-atomic).
6778 - Must happen before
6780 buffer_wbinvl1_vol and
6791 fence-paired-atomic.
6793 2. buffer_wbinvl1_vol
6795 - If not TgSplit execution
6802 fence acquire - agent *none* 1. s_waitcnt lgkmcnt(0) &
6805 - If TgSplit execution mode,
6811 - However, since LLVM
6819 - Could be split into
6828 - s_waitcnt vmcnt(0)
6839 fence-paired-atomic).
6840 - s_waitcnt lgkmcnt(0)
6851 fence-paired-atomic).
6852 - Must happen before
6866 fence-paired-atomic.
6868 2. buffer_wbinvl1_vol
6870 - Must happen before any
6871 following global/generic
6880 fence acquire - system *none* 1. s_waitcnt lgkmcnt(0) &
6883 - If TgSplit execution mode,
6889 - However, since LLVM
6897 - Could be split into
6906 - s_waitcnt vmcnt(0)
6917 fence-paired-atomic).
6918 - s_waitcnt lgkmcnt(0)
6929 fence-paired-atomic).
6930 - Must happen before
6931 the following buffer_invl2 and
6944 fence-paired-atomic.
6949 - Must happen before any
6950 following global/generic
6957 stale L1 global data,
6958 nor see stale L2 MTYPE
6960 MTYPE RW and CC memory will
6961 never be stale in L2 due to
6964 ------------------------------------------------------------------------------------
6965 store atomic release - singlethread - global 1. buffer/global/flat_store
6966 - wavefront - generic
6967 store atomic release - singlethread - local *If TgSplit execution mode,
6968 - wavefront local address space cannot
6972 store atomic release - workgroup - global 1. s_waitcnt lgkm/vmcnt(0)
6974 - Use lgkmcnt(0) if not
6975 TgSplit execution mode
6976 and vmcnt(0) if TgSplit
6978 - If OpenCL, omit lgkmcnt(0).
6979 - s_waitcnt vmcnt(0)
6982 global/generic load/store/
6983 load atomic/store atomic/
6985 - s_waitcnt lgkmcnt(0)
6992 - Must happen before
7003 2. buffer/global/flat_store
7004 store atomic release - workgroup - local *If TgSplit execution mode,
7005 local address space cannot
7009 store atomic release - agent - global 1. s_waitcnt lgkmcnt(0) &
7012 - If TgSplit execution mode,
7018 - Could be split into
7027 - s_waitcnt vmcnt(0)
7034 - s_waitcnt lgkmcnt(0)
7041 - Must happen before
7052 2. buffer/global/flat_store
7053 store atomic release - system - global 1. buffer_wbl2
7055 - Must happen before
7056 following s_waitcnt.
7057 - Performs L2 writeback to
7061 visible at system scope.
7063 2. s_waitcnt lgkmcnt(0) &
7066 - If TgSplit execution mode,
7072 - Could be split into
7081 - s_waitcnt vmcnt(0)
7082 must happen after any
7088 - s_waitcnt lgkmcnt(0)
7089 must happen after any
7095 - Must happen before
7100 to memory and the L2
7107 3. buffer/global/flat_store
7108 atomicrmw release - singlethread - global 1. buffer/global/flat_atomic
7109 - wavefront - generic
7110 atomicrmw release - singlethread - local *If TgSplit execution mode,
7111 - wavefront local address space cannot
7115 atomicrmw release - workgroup - global 1. s_waitcnt lgkm/vmcnt(0)
7117 - Use lgkmcnt(0) if not
7118 TgSplit execution mode
7119 and vmcnt(0) if TgSplit
7123 - s_waitcnt vmcnt(0)
7126 global/generic load/store/
7127 load atomic/store atomic/
7129 - s_waitcnt lgkmcnt(0)
7136 - Must happen before
7147 2. buffer/global/flat_atomic
7148 atomicrmw release - workgroup - local *If TgSplit execution mode,
7149 local address space cannot
7153 atomicrmw release - agent - global 1. s_waitcnt lgkmcnt(0) &
7156 - If TgSplit execution mode,
7160 - Could be split into
7169 - s_waitcnt vmcnt(0)
7176 - s_waitcnt lgkmcnt(0)
7183 - Must happen before
7194 2. buffer/global/flat_atomic
7195 atomicrmw release - system - global 1. buffer_wbl2
7197 - Must happen before
7198 following s_waitcnt.
7199 - Performs L2 writeback to
7203 visible at system scope.
7205 2. s_waitcnt lgkmcnt(0) &
7208 - If TgSplit execution mode,
7212 - Could be split into
7221 - s_waitcnt vmcnt(0)
7228 - s_waitcnt lgkmcnt(0)
7235 - Must happen before
7240 to memory and the L2
7247 3. buffer/global/flat_atomic
7248 fence release - singlethread *none* *none*
7250 fence release - workgroup *none* 1. s_waitcnt lgkm/vmcnt(0)
7252 - Use lgkmcnt(0) if not
7253 TgSplit execution mode
7254 and vmcnt(0) if TgSplit
7264 - However, since LLVM
7279 - s_waitcnt vmcnt(0)
7284 load atomic/store atomic/
7286 - s_waitcnt lgkmcnt(0)
7293 - Must happen before
7302 fence-paired-atomic).
7309 fence-paired-atomic.
7311 fence release - agent *none* 1. s_waitcnt lgkmcnt(0) &
7314 - If TgSplit execution mode,
7324 - However, since LLVM
7339 - Could be split into
7348 - s_waitcnt vmcnt(0)
7355 - s_waitcnt lgkmcnt(0)
7362 - Must happen before
7371 fence-paired-atomic).
7378 fence-paired-atomic.
7380 fence release - system *none* 1. buffer_wbl2
7385 - Must happen before
7386 following s_waitcnt.
7387 - Performs L2 writeback to
7391 visible at system scope.
7393 2. s_waitcnt lgkmcnt(0) &
7396 - If TgSplit execution mode,
7406 - However, since LLVM
7421 - Could be split into
7430 - s_waitcnt vmcnt(0)
7437 - s_waitcnt lgkmcnt(0)
7444 - Must happen before
7453 fence-paired-atomic).
7460 fence-paired-atomic.
7462 **Acquire-Release Atomic**
7463 ------------------------------------------------------------------------------------
7464 atomicrmw acq_rel - singlethread - global 1. buffer/global/flat_atomic
7465 - wavefront - generic
7466 atomicrmw acq_rel - singlethread - local *If TgSplit execution mode,
7467 - wavefront local address space cannot
7471 atomicrmw acq_rel - workgroup - global 1. s_waitcnt lgkm/vmcnt(0)
7473 - Use lgkmcnt(0) if not
7474 TgSplit execution mode
7475 and vmcnt(0) if TgSplit
7485 - s_waitcnt vmcnt(0)
7488 global/generic load/store/
7489 load atomic/store atomic/
7491 - s_waitcnt lgkmcnt(0)
7498 - Must happen before
7509 2. buffer/global_atomic
7510 3. s_waitcnt vmcnt(0)
7512 - If not TgSplit execution
7514 - Must happen before
7524 4. buffer_wbinvl1_vol
7526 - If not TgSplit execution
7533 atomicrmw acq_rel - workgroup - local *If TgSplit execution mode,
7534 local address space cannot
7538 2. s_waitcnt lgkmcnt(0)
7541 - Must happen before
7550 older than the local load
7554 atomicrmw acq_rel - workgroup - generic 1. s_waitcnt lgkm/vmcnt(0)
7556 - Use lgkmcnt(0) if not
7557 TgSplit execution mode
7558 and vmcnt(0) if TgSplit
7562 - s_waitcnt vmcnt(0)
7565 global/generic load/store/
7566 load atomic/store atomic/
7568 - s_waitcnt lgkmcnt(0)
7575 - Must happen before
7587 3. s_waitcnt lgkmcnt(0) &
7590 - If not TgSplit execution
7591 mode, omit vmcnt(0).
7594 - Must happen before
7596 buffer_wbinvl1_vol and
7605 older than a local load
7609 3. buffer_wbinvl1_vol
7611 - If not TgSplit execution
7618 atomicrmw acq_rel - agent - global 1. s_waitcnt lgkmcnt(0) &
7621 - If TgSplit execution mode,
7625 - Could be split into
7634 - s_waitcnt vmcnt(0)
7641 - s_waitcnt lgkmcnt(0)
7648 - Must happen before
7659 2. buffer/global_atomic
7660 3. s_waitcnt vmcnt(0)
7662 - Must happen before
7671 4. buffer_wbinvl1_vol
7673 - Must happen before
7683 atomicrmw acq_rel - system - global 1. buffer_wbl2
7685 - Must happen before
7686 following s_waitcnt.
7687 - Performs L2 writeback to
7691 visible at system scope.
7693 2. s_waitcnt lgkmcnt(0) &
7696 - If TgSplit execution mode,
7700 - Could be split into
7709 - s_waitcnt vmcnt(0)
7716 - s_waitcnt lgkmcnt(0)
7723 - Must happen before
7728 to global and L2 writeback
7729 have completed before
7734 3. buffer/global_atomic
7735 4. s_waitcnt vmcnt(0)
7737 - Must happen before
7738 following buffer_invl2 and
7749 - Must happen before
7757 stale L1 global data,
7758 nor see stale L2 MTYPE
7760 MTYPE RW and CC memory will
7761 never be stale in L2 due to
7764 atomicrmw acq_rel - agent - generic 1. s_waitcnt lgkmcnt(0) &
7767 - If TgSplit execution mode,
7771 - Could be split into
7780 - s_waitcnt vmcnt(0)
7787 - s_waitcnt lgkmcnt(0)
7794 - Must happen before
7806 3. s_waitcnt vmcnt(0) &
7809 - If TgSplit execution mode,
7813 - Must happen before
7822 4. buffer_wbinvl1_vol
7824 - Must happen before
7834 atomicrmw acq_rel - system - generic 1. buffer_wbl2
7836 - Must happen before
7837 following s_waitcnt.
7838 - Performs L2 writeback to
7842 visible at system scope.
7844 2. s_waitcnt lgkmcnt(0) &
7847 - If TgSplit execution mode,
7851 - Could be split into
7860 - s_waitcnt vmcnt(0)
7867 - s_waitcnt lgkmcnt(0)
7874 - Must happen before
7879 to global and L2 writeback
7880 have completed before
7886 4. s_waitcnt vmcnt(0) &
7889 - If TgSplit execution mode,
7893 - Must happen before
7894 following buffer_invl2 and
7905 - Must happen before
7913 stale L1 global data,
7914 nor see stale L2 MTYPE
7916 MTYPE RW and CC memory will
7917 never be stale in L2 due to
7920 fence acq_rel - singlethread *none* *none*
7922 fence acq_rel - workgroup *none* 1. s_waitcnt lgkm/vmcnt(0)
7924 - Use lgkmcnt(0) if not
7925 TgSplit execution mode
7926 and vmcnt(0) if TgSplit
7945 - s_waitcnt vmcnt(0)
7950 load atomic/store atomic/
7952 - s_waitcnt lgkmcnt(0)
7959 - Must happen before
7982 acquire-fence-paired-atomic)
8003 release-fence-paired-atomic).
8007 - Must happen before
8011 acquire-fence-paired
8012 atomic has completed
8021 acquire-fence-paired-atomic.
8023 2. buffer_wbinvl1_vol
8025 - If not TgSplit execution
8032 fence acq_rel - agent *none* 1. s_waitcnt lgkmcnt(0) &
8035 - If TgSplit execution mode,
8041 - However, since LLVM
8049 - Could be split into
8058 - s_waitcnt vmcnt(0)
8065 - s_waitcnt lgkmcnt(0)
8072 - Must happen before
8077 global/local/generic
8086 acquire-fence-paired-atomic)
8098 global/local/generic
8107 release-fence-paired-atomic).
8112 2. buffer_wbinvl1_vol
8114 - Must happen before
8128 fence acq_rel - system *none* 1. buffer_wbl2
8133 - Must happen before
8134 following s_waitcnt.
8135 - Performs L2 writeback to
8139 visible at system scope.
8141 2. s_waitcnt lgkmcnt(0) &
8144 - If TgSplit execution mode,
8150 - However, since LLVM
8158 - Could be split into
8167 - s_waitcnt vmcnt(0)
8174 - s_waitcnt lgkmcnt(0)
8181 - Must happen before
8182 the following buffer_invl2 and
8186 global/local/generic
8195 acquire-fence-paired-atomic)
8207 global/local/generic
8216 release-fence-paired-atomic).
8224 - Must happen before
8233 stale L1 global data,
8234 nor see stale L2 MTYPE
8236 MTYPE RW and CC memory will
8237 never be stale in L2 due to
8240 **Sequential Consistent Atomic**
8241 ------------------------------------------------------------------------------------
8242 load atomic seq_cst - singlethread - global *Same as corresponding
8243 - wavefront - local load atomic acquire,
8244 - generic except must generated
8245 all instructions even
8247 load atomic seq_cst - workgroup - global 1. s_waitcnt lgkm/vmcnt(0)
8249 - Use lgkmcnt(0) if not
8250 TgSplit execution mode
8251 and vmcnt(0) if TgSplit
8253 - s_waitcnt lgkmcnt(0) must
8266 lgkmcnt(0) and so do
8269 - s_waitcnt vmcnt(0)
8288 consistent global/local
8314 order. The s_waitcnt
8315 could be placed after
8319 make the s_waitcnt be
8326 instructions same as
8329 except must generated
8330 all instructions even
8332 load atomic seq_cst - workgroup - local *If TgSplit execution mode,
8333 local address space cannot
8336 *Same as corresponding
8337 load atomic acquire,
8338 except must generated
8339 all instructions even
8342 load atomic seq_cst - agent - global 1. s_waitcnt lgkmcnt(0) &
8343 - system - generic vmcnt(0)
8345 - If TgSplit execution mode,
8347 - Could be split into
8356 - s_waitcnt lgkmcnt(0)
8369 lgkmcnt(0) and so do
8372 - s_waitcnt vmcnt(0)
8417 order. The s_waitcnt
8418 could be placed after
8422 make the s_waitcnt be
8429 instructions same as
8432 except must generated
8433 all instructions even
8435 store atomic seq_cst - singlethread - global *Same as corresponding
8436 - wavefront - local store atomic release,
8437 - workgroup - generic except must generated
8438 - agent all instructions even
8439 - system for OpenCL.*
8440 atomicrmw seq_cst - singlethread - global *Same as corresponding
8441 - wavefront - local atomicrmw acq_rel,
8442 - workgroup - generic except must generated
8443 - agent all instructions even
8444 - system for OpenCL.*
8445 fence seq_cst - singlethread *none* *Same as corresponding
8446 - wavefront fence acq_rel,
8447 - workgroup except must generated
8448 - agent all instructions even
8449 - system for OpenCL.*
8450 ============ ============ ============== ========== ================================
8452 .. _amdgpu-amdhsa-memory-model-gfx10:
8459 * Each agent has multiple shader arrays (SA).
8460 * Each SA has multiple work-group processors (WGP).
8461 * Each WGP has multiple compute units (CU).
8462 * Each CU has multiple SIMDs that execute wavefronts.
8463 * The wavefronts for a single work-group are executed in the same
8464 WGP. In CU wavefront execution mode the wavefronts may be executed by
8465 different SIMDs in the same CU. In WGP wavefront execution mode the
8466 wavefronts may be executed by different SIMDs in different CUs in the same
8468 * Each WGP has a single LDS memory shared by the wavefronts of the work-groups
8470 * All LDS operations of a WGP are performed as wavefront wide operations in a
8471 global order and involve no caching. Completion is reported to a wavefront in
8473 * The LDS memory has multiple request queues shared by the SIMDs of a
8474 WGP. Therefore, the LDS operations performed by different wavefronts of a
8475 work-group can be reordered relative to each other, which can result in
8476 reordering the visibility of vector memory operations with respect to LDS
8477 operations of other wavefronts in the same work-group. A ``s_waitcnt
8478 lgkmcnt(0)`` is required to ensure synchronization between LDS operations and
8479 vector memory operations between wavefronts of a work-group, but not between
8480 operations performed by the same wavefront.
8481 * The vector memory operations are performed as wavefront wide operations.
8482 Completion of load/store/sample operations are reported to a wavefront in
8483 execution order of other load/store/sample operations performed by that
8485 * The vector memory operations access a vector L0 cache. There is a single L0
8486 cache per CU. Each SIMD of a CU accesses the same L0 cache. Therefore, no
8487 special action is required for coherence between the lanes of a single
8488 wavefront. However, a ``buffer_gl0_inv`` is required for coherence between
8489 wavefronts executing in the same work-group as they may be executing on SIMDs
8490 of different CUs that access different L0s. A ``buffer_gl0_inv`` is also
8491 required for coherence between wavefronts executing in different work-groups
8492 as they may be executing on different WGPs.
8493 * The scalar memory operations access a scalar L0 cache shared by all wavefronts
8494 on a WGP. The scalar and vector L0 caches are not coherent. However, scalar
8495 operations are used in a restricted way so do not impact the memory model. See
8496 :ref:`amdgpu-amdhsa-memory-spaces`.
8497 * The vector and scalar memory L0 caches use an L1 cache shared by all WGPs on
8498 the same SA. Therefore, no special action is required for coherence between
8499 the wavefronts of a single work-group. However, a ``buffer_gl1_inv`` is
8500 required for coherence between wavefronts executing in different work-groups
8501 as they may be executing on different SAs that access different L1s.
8502 * The L1 caches have independent quadrants to service disjoint ranges of virtual
8504 * Each L0 cache has a separate request queue per L1 quadrant. Therefore, the
8505 vector and scalar memory operations performed by different wavefronts, whether
8506 executing in the same or different work-groups (which may be executing on
8507 different CUs accessing different L0s), can be reordered relative to each
8508 other. A ``s_waitcnt vmcnt(0) & vscnt(0)`` is required to ensure
8509 synchronization between vector memory operations of different wavefronts. It
8510 ensures a previous vector memory operation has completed before executing a
8511 subsequent vector memory or LDS operation and so can be used to meet the
8512 requirements of acquire, release and sequential consistency.
8513 * The L1 caches use an L2 cache shared by all SAs on the same agent.
8514 * The L2 cache has independent channels to service disjoint ranges of virtual
8516 * Each L1 quadrant of a single SA accesses a different L2 channel. Each L1
8517 quadrant has a separate request queue per L2 channel. Therefore, the vector
8518 and scalar memory operations performed by wavefronts executing in different
8519 work-groups (which may be executing on different SAs) of an agent can be
8520 reordered relative to each other. A ``s_waitcnt vmcnt(0) & vscnt(0)`` is
8521 required to ensure synchronization between vector memory operations of
8522 different SAs. It ensures a previous vector memory operation has completed
8523 before executing a subsequent vector memory and so can be used to meet the
8524 requirements of acquire, release and sequential consistency.
8525 * The L2 cache can be kept coherent with other agents on some targets, or ranges
8526 of virtual addresses can be set up to bypass it to ensure system coherence.
8528 Scalar memory operations are only used to access memory that is proven to not
8529 change during the execution of the kernel dispatch. This includes constant
8530 address space and global address space for program scope ``const`` variables.
8531 Therefore, the kernel machine code does not have to maintain the scalar cache to
8532 ensure it is coherent with the vector caches. The scalar and vector caches are
8533 invalidated between kernel dispatches by CP since constant address space data
8534 may change between kernel dispatch executions. See
8535 :ref:`amdgpu-amdhsa-memory-spaces`.
8537 The one exception is if scalar writes are used to spill SGPR registers. In this
8538 case the AMDGPU backend ensures the memory location used to spill is never
8539 accessed by vector memory operations at the same time. If scalar writes are used
8540 then a ``s_dcache_wb`` is inserted before the ``s_endpgm`` and before a function
8541 return since the locations may be used for vector memory instructions by a
8542 future wavefront that uses the same scratch area, or a function call that
8543 creates a frame at the same address, respectively. There is no need for a
8544 ``s_dcache_inv`` as all scalar writes are write-before-read in the same thread.
8546 For kernarg backing memory:
8548 * CP invalidates the L0 and L1 caches at the start of each kernel dispatch.
8549 * On dGPU the kernarg backing memory is accessed as MTYPE UC (uncached) to avoid
8550 needing to invalidate the L2 cache.
8551 * On APU the kernarg backing memory is accessed as MTYPE CC (cache coherent) and
8552 so the L2 cache will be coherent with the CPU and other agents.
8554 Scratch backing memory (which is used for the private address space) is accessed
8555 with MTYPE NC (non-coherent). Since the private address space is only accessed
8556 by a single thread, and is always write-before-read, there is never a need to
8557 invalidate these entries from the L0 or L1 caches.
8559 Wavefronts are executed in native mode with in-order reporting of loads and
8560 sample instructions. In this mode vmcnt reports completion of load, atomic with
8561 return and sample instructions in order, and the vscnt reports the completion of
8562 store and atomic without return in order. See ``MEM_ORDERED`` field in
8563 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
8565 Wavefronts can be executed in WGP or CU wavefront execution mode:
8567 * In WGP wavefront execution mode the wavefronts of a work-group are executed
8568 on the SIMDs of both CUs of the WGP. Therefore, explicit management of the per
8569 CU L0 caches is required for work-group synchronization. Also accesses to L1
8570 at work-group scope need to be explicitly ordered as the accesses from
8571 different CUs are not ordered.
8572 * In CU wavefront execution mode the wavefronts of a work-group are executed on
8573 the SIMDs of a single CU of the WGP. Therefore, all global memory access by
8574 the work-group access the same L0 which in turn ensures L1 accesses are
8575 ordered and so do not require explicit management of the caches for
8576 work-group synchronization.
8578 See ``WGP_MODE`` field in
8579 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table` and
8580 :ref:`amdgpu-target-features`.
8582 The code sequences used to implement the memory model for GFX10 are defined in
8583 table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx10-table`.
8585 .. table:: AMDHSA Memory Model Code Sequences GFX10
8586 :name: amdgpu-amdhsa-memory-model-code-sequences-gfx10-table
8588 ============ ============ ============== ========== ================================
8589 LLVM Instr LLVM Memory LLVM Memory AMDGPU AMDGPU Machine Code
8590 Ordering Sync Scope Address GFX10
8592 ============ ============ ============== ========== ================================
8594 ------------------------------------------------------------------------------------
8595 load *none* *none* - global - !volatile & !nontemporal
8597 - private 1. buffer/global/flat_load
8599 - !volatile & nontemporal
8601 1. buffer/global/flat_load
8606 1. buffer/global/flat_load
8608 2. s_waitcnt vmcnt(0)
8610 - Must happen before
8611 any following volatile
8622 load *none* *none* - local 1. ds_load
8623 store *none* *none* - global - !volatile & !nontemporal
8625 - private 1. buffer/global/flat_store
8627 - !volatile & nontemporal
8629 1. buffer/global/flat_store
8634 1. buffer/global/flat_store
8635 2. s_waitcnt vscnt(0)
8637 - Must happen before
8638 any following volatile
8649 store *none* *none* - local 1. ds_store
8650 **Unordered Atomic**
8651 ------------------------------------------------------------------------------------
8652 load atomic unordered *any* *any* *Same as non-atomic*.
8653 store atomic unordered *any* *any* *Same as non-atomic*.
8654 atomicrmw unordered *any* *any* *Same as monotonic atomic*.
8655 **Monotonic Atomic**
8656 ------------------------------------------------------------------------------------
8657 load atomic monotonic - singlethread - global 1. buffer/global/flat_load
8658 - wavefront - generic
8659 load atomic monotonic - workgroup - global 1. buffer/global/flat_load
8662 - If CU wavefront execution
8665 load atomic monotonic - singlethread - local 1. ds_load
8668 load atomic monotonic - agent - global 1. buffer/global/flat_load
8669 - system - generic glc=1 dlc=1
8670 store atomic monotonic - singlethread - global 1. buffer/global/flat_store
8671 - wavefront - generic
8675 store atomic monotonic - singlethread - local 1. ds_store
8678 atomicrmw monotonic - singlethread - global 1. buffer/global/flat_atomic
8679 - wavefront - generic
8683 atomicrmw monotonic - singlethread - local 1. ds_atomic
8687 ------------------------------------------------------------------------------------
8688 load atomic acquire - singlethread - global 1. buffer/global/ds/flat_load
8691 load atomic acquire - workgroup - global 1. buffer/global_load glc=1
8693 - If CU wavefront execution
8696 2. s_waitcnt vmcnt(0)
8698 - If CU wavefront execution
8700 - Must happen before
8701 the following buffer_gl0_inv
8702 and before any following
8710 - If CU wavefront execution
8717 load atomic acquire - workgroup - local 1. ds_load
8718 2. s_waitcnt lgkmcnt(0)
8721 - Must happen before
8722 the following buffer_gl0_inv
8723 and before any following
8724 global/generic load/load
8730 older than the local load
8736 - If CU wavefront execution
8744 load atomic acquire - workgroup - generic 1. flat_load glc=1
8746 - If CU wavefront execution
8749 2. s_waitcnt lgkmcnt(0) &
8752 - If CU wavefront execution
8753 mode, omit vmcnt(0).
8756 - Must happen before
8758 buffer_gl0_inv and any
8759 following global/generic
8766 older than a local load
8772 - If CU wavefront execution
8779 load atomic acquire - agent - global 1. buffer/global_load
8780 - system glc=1 dlc=1
8781 2. s_waitcnt vmcnt(0)
8783 - Must happen before
8794 - Must happen before
8804 load atomic acquire - agent - generic 1. flat_load glc=1 dlc=1
8805 - system 2. s_waitcnt vmcnt(0) &
8810 - Must happen before
8813 - Ensures the flat_load
8821 - Must happen before
8831 atomicrmw acquire - singlethread - global 1. buffer/global/ds/flat_atomic
8834 atomicrmw acquire - workgroup - global 1. buffer/global_atomic
8835 2. s_waitcnt vm/vscnt(0)
8837 - If CU wavefront execution
8839 - Use vmcnt(0) if atomic with
8840 return and vscnt(0) if
8841 atomic with no-return.
8842 - Must happen before
8843 the following buffer_gl0_inv
8844 and before any following
8852 - If CU wavefront execution
8859 atomicrmw acquire - workgroup - local 1. ds_atomic
8860 2. s_waitcnt lgkmcnt(0)
8863 - Must happen before
8869 older than the local
8881 atomicrmw acquire - workgroup - generic 1. flat_atomic
8882 2. s_waitcnt lgkmcnt(0) &
8885 - If CU wavefront execution
8886 mode, omit vm/vscnt(0).
8887 - If OpenCL, omit lgkmcnt(0).
8888 - Use vmcnt(0) if atomic with
8889 return and vscnt(0) if
8890 atomic with no-return.
8891 - Must happen before
8903 - If CU wavefront execution
8910 atomicrmw acquire - agent - global 1. buffer/global_atomic
8911 - system 2. s_waitcnt vm/vscnt(0)
8913 - Use vmcnt(0) if atomic with
8914 return and vscnt(0) if
8915 atomic with no-return.
8916 - Must happen before
8928 - Must happen before
8938 atomicrmw acquire - agent - generic 1. flat_atomic
8939 - system 2. s_waitcnt vm/vscnt(0) &
8944 - Use vmcnt(0) if atomic with
8945 return and vscnt(0) if
8946 atomic with no-return.
8947 - Must happen before
8959 - Must happen before
8969 fence acquire - singlethread *none* *none*
8971 fence acquire - workgroup *none* 1. s_waitcnt lgkmcnt(0) &
8974 - If CU wavefront execution
8975 mode, omit vmcnt(0) and
8984 vmcnt(0) and vscnt(0).
8985 - However, since LLVM
9000 - Could be split into
9003 vscnt(0) and s_waitcnt
9009 - s_waitcnt vmcnt(0)
9014 atomicrmw-with-return-value
9021 fence-paired-atomic).
9022 - s_waitcnt vscnt(0)
9026 atomicrmw-no-return-value
9033 fence-paired-atomic).
9034 - s_waitcnt lgkmcnt(0)
9045 fence-paired-atomic).
9046 - Must happen before
9060 fence-paired-atomic.
9064 - If CU wavefront execution
9071 fence acquire - agent *none* 1. s_waitcnt lgkmcnt(0) &
9072 - system vmcnt(0) & vscnt(0)
9081 vmcnt(0) and vscnt(0).
9082 - However, since LLVM
9090 - Could be split into
9093 vscnt(0) and s_waitcnt
9099 - s_waitcnt vmcnt(0)
9104 atomicrmw-with-return-value
9111 fence-paired-atomic).
9112 - s_waitcnt vscnt(0)
9116 atomicrmw-no-return-value
9123 fence-paired-atomic).
9124 - s_waitcnt lgkmcnt(0)
9135 fence-paired-atomic).
9136 - Must happen before
9150 fence-paired-atomic.
9155 - Must happen before any
9156 following global/generic
9166 ------------------------------------------------------------------------------------
9167 store atomic release - singlethread - global 1. buffer/global/ds/flat_store
9170 store atomic release - workgroup - global 1. s_waitcnt lgkmcnt(0) &
9171 - generic vmcnt(0) & vscnt(0)
9173 - If CU wavefront execution
9174 mode, omit vmcnt(0) and
9178 - Could be split into
9181 vscnt(0) and s_waitcnt
9187 - s_waitcnt vmcnt(0)
9190 global/generic load/load
9192 atomicrmw-with-return-value.
9193 - s_waitcnt vscnt(0)
9199 atomicrmw-no-return-value.
9200 - s_waitcnt lgkmcnt(0)
9207 - Must happen before
9218 2. buffer/global/flat_store
9219 store atomic release - workgroup - local 1. s_waitcnt vmcnt(0) & vscnt(0)
9221 - If CU wavefront execution
9224 - Could be split into
9226 vmcnt(0) and s_waitcnt
9232 - s_waitcnt vmcnt(0)
9235 global/generic load/load
9237 atomicrmw-with-return-value.
9238 - s_waitcnt vscnt(0)
9243 atomicrmw-no-return-value.
9244 - Must happen before
9256 store atomic release - agent - global 1. s_waitcnt lgkmcnt(0) &
9257 - system - generic vmcnt(0) & vscnt(0)
9263 - Could be split into
9265 vmcnt(0), s_waitcnt vscnt(0)
9272 - s_waitcnt vmcnt(0)
9278 atomicrmw-with-return-value.
9279 - s_waitcnt vscnt(0)
9284 atomicrmw-no-return-value.
9285 - s_waitcnt lgkmcnt(0)
9292 - Must happen before
9303 2. buffer/global/flat_store
9304 atomicrmw release - singlethread - global 1. buffer/global/ds/flat_atomic
9307 atomicrmw release - workgroup - global 1. s_waitcnt lgkmcnt(0) &
9308 - generic vmcnt(0) & vscnt(0)
9310 - If CU wavefront execution
9311 mode, omit vmcnt(0) and
9313 - If OpenCL, omit lgkmcnt(0).
9314 - Could be split into
9317 vscnt(0) and s_waitcnt
9323 - s_waitcnt vmcnt(0)
9326 global/generic load/load
9328 atomicrmw-with-return-value.
9329 - s_waitcnt vscnt(0)
9335 atomicrmw-no-return-value.
9336 - s_waitcnt lgkmcnt(0)
9343 - Must happen before
9354 2. buffer/global/flat_atomic
9355 atomicrmw release - workgroup - local 1. s_waitcnt vmcnt(0) & vscnt(0)
9357 - If CU wavefront execution
9360 - Could be split into
9362 vmcnt(0) and s_waitcnt
9368 - s_waitcnt vmcnt(0)
9371 global/generic load/load
9373 atomicrmw-with-return-value.
9374 - s_waitcnt vscnt(0)
9379 atomicrmw-no-return-value.
9380 - Must happen before
9392 atomicrmw release - agent - global 1. s_waitcnt lgkmcnt(0) &
9393 - system - generic vmcnt(0) & vscnt(0)
9397 - Could be split into
9400 vscnt(0) and s_waitcnt
9406 - s_waitcnt vmcnt(0)
9411 atomicrmw-with-return-value.
9412 - s_waitcnt vscnt(0)
9417 atomicrmw-no-return-value.
9418 - s_waitcnt lgkmcnt(0)
9425 - Must happen before
9436 2. buffer/global/flat_atomic
9437 fence release - singlethread *none* *none*
9439 fence release - workgroup *none* 1. s_waitcnt lgkmcnt(0) &
9442 - If CU wavefront execution
9443 mode, omit vmcnt(0) and
9452 vmcnt(0) and vscnt(0).
9453 - However, since LLVM
9468 - Could be split into
9471 vscnt(0) and s_waitcnt
9477 - s_waitcnt vmcnt(0)
9483 atomicrmw-with-return-value.
9484 - s_waitcnt vscnt(0)
9489 atomicrmw-no-return-value.
9490 - s_waitcnt lgkmcnt(0)
9495 atomic/store atomic/
9497 - Must happen before
9506 fence-paired-atomic).
9513 fence-paired-atomic.
9515 fence release - agent *none* 1. s_waitcnt lgkmcnt(0) &
9516 - system vmcnt(0) & vscnt(0)
9525 vmcnt(0) and vscnt(0).
9526 - However, since LLVM
9541 - Could be split into
9544 vscnt(0) and s_waitcnt
9550 - s_waitcnt vmcnt(0)
9555 atomicrmw-with-return-value.
9556 - s_waitcnt vscnt(0)
9561 atomicrmw-no-return-value.
9562 - s_waitcnt lgkmcnt(0)
9569 - Must happen before
9578 fence-paired-atomic).
9585 fence-paired-atomic.
9587 **Acquire-Release Atomic**
9588 ------------------------------------------------------------------------------------
9589 atomicrmw acq_rel - singlethread - global 1. buffer/global/ds/flat_atomic
9592 atomicrmw acq_rel - workgroup - global 1. s_waitcnt lgkmcnt(0) &
9595 - If CU wavefront execution
9596 mode, omit vmcnt(0) and
9606 - Could be split into
9609 vscnt(0), and s_waitcnt
9615 - s_waitcnt vmcnt(0)
9618 global/generic load/load
9620 atomicrmw-with-return-value.
9621 - s_waitcnt vscnt(0)
9627 atomicrmw-no-return-value.
9628 - s_waitcnt lgkmcnt(0)
9635 - Must happen before
9646 2. buffer/global_atomic
9647 3. s_waitcnt vm/vscnt(0)
9649 - If CU wavefront execution
9651 - Use vmcnt(0) if atomic with
9652 return and vscnt(0) if
9653 atomic with no-return.
9654 - Must happen before
9666 - If CU wavefront execution
9673 atomicrmw acq_rel - workgroup - local 1. s_waitcnt vmcnt(0) & vscnt(0)
9675 - If CU wavefront execution
9678 - Could be split into
9680 vmcnt(0) and s_waitcnt
9686 - s_waitcnt vmcnt(0)
9689 global/generic load/load
9691 atomicrmw-with-return-value.
9692 - s_waitcnt vscnt(0)
9697 atomicrmw-no-return-value.
9698 - Must happen before
9710 3. s_waitcnt lgkmcnt(0)
9713 - Must happen before
9719 older than the local load
9725 - If CU wavefront execution
9733 atomicrmw acq_rel - workgroup - generic 1. s_waitcnt lgkmcnt(0) &
9736 - If CU wavefront execution
9737 mode, omit vmcnt(0) and
9739 - If OpenCL, omit lgkmcnt(0).
9740 - Could be split into
9743 vscnt(0) and s_waitcnt
9749 - s_waitcnt vmcnt(0)
9752 global/generic load/load
9754 atomicrmw-with-return-value.
9755 - s_waitcnt vscnt(0)
9761 atomicrmw-no-return-value.
9762 - s_waitcnt lgkmcnt(0)
9769 - Must happen before
9781 3. s_waitcnt lgkmcnt(0) &
9784 - If CU wavefront execution
9785 mode, omit vmcnt(0) and
9787 - If OpenCL, omit lgkmcnt(0).
9788 - Must happen before
9800 - If CU wavefront execution
9807 atomicrmw acq_rel - agent - global 1. s_waitcnt lgkmcnt(0) &
9808 - system vmcnt(0) & vscnt(0)
9812 - Could be split into
9815 vscnt(0) and s_waitcnt
9821 - s_waitcnt vmcnt(0)
9826 atomicrmw-with-return-value.
9827 - s_waitcnt vscnt(0)
9832 atomicrmw-no-return-value.
9833 - s_waitcnt lgkmcnt(0)
9840 - Must happen before
9851 2. buffer/global_atomic
9852 3. s_waitcnt vm/vscnt(0)
9854 - Use vmcnt(0) if atomic with
9855 return and vscnt(0) if
9856 atomic with no-return.
9857 - Must happen before
9869 - Must happen before
9879 atomicrmw acq_rel - agent - generic 1. s_waitcnt lgkmcnt(0) &
9880 - system vmcnt(0) & vscnt(0)
9884 - Could be split into
9887 vscnt(0), and s_waitcnt
9893 - s_waitcnt vmcnt(0)
9898 atomicrmw-with-return-value.
9899 - s_waitcnt vscnt(0)
9904 atomicrmw-no-return-value.
9905 - s_waitcnt lgkmcnt(0)
9912 - Must happen before
9924 3. s_waitcnt vm/vscnt(0) &
9929 - Use vmcnt(0) if atomic with
9930 return and vscnt(0) if
9931 atomic with no-return.
9932 - Must happen before
9944 - Must happen before
9954 fence acq_rel - singlethread *none* *none*
9956 fence acq_rel - workgroup *none* 1. s_waitcnt lgkmcnt(0) &
9959 - If CU wavefront execution
9960 mode, omit vmcnt(0) and
9969 vmcnt(0) and vscnt(0).
9979 - Could be split into
9982 vscnt(0) and s_waitcnt
9988 - s_waitcnt vmcnt(0)
9994 atomicrmw-with-return-value.
9995 - s_waitcnt vscnt(0)
10000 atomicrmw-no-return-value.
10001 - s_waitcnt lgkmcnt(0)
10006 atomic/store atomic/
10008 - Must happen before
10027 and memory ordering
10031 acquire-fence-paired-atomic)
10044 local/generic store
10048 and memory ordering
10052 release-fence-paired-atomic).
10056 - Must happen before
10060 acquire-fence-paired
10061 atomic has completed
10062 before invalidating
10066 locations read must
10070 acquire-fence-paired-atomic.
10074 - If CU wavefront execution
10081 fence acq_rel - agent *none* 1. s_waitcnt lgkmcnt(0) &
10082 - system vmcnt(0) & vscnt(0)
10091 vmcnt(0) and vscnt(0).
10092 - However, since LLVM
10100 - Could be split into
10102 vmcnt(0), s_waitcnt
10103 vscnt(0) and s_waitcnt
10104 lgkmcnt(0) to allow
10106 independently moved
10109 - s_waitcnt vmcnt(0)
10115 atomicrmw-with-return-value.
10116 - s_waitcnt vscnt(0)
10120 store/store atomic/
10121 atomicrmw-no-return-value.
10122 - s_waitcnt lgkmcnt(0)
10129 - Must happen before
10134 global/local/generic
10139 and memory ordering
10143 acquire-fence-paired-atomic)
10145 before invalidating
10155 global/local/generic
10160 and memory ordering
10164 release-fence-paired-atomic).
10172 - Must happen before
10186 **Sequential Consistent Atomic**
10187 ------------------------------------------------------------------------------------
10188 load atomic seq_cst - singlethread - global *Same as corresponding
10189 - wavefront - local load atomic acquire,
10190 - generic except must generated
10191 all instructions even
10193 load atomic seq_cst - workgroup - global 1. s_waitcnt lgkmcnt(0) &
10194 - generic vmcnt(0) & vscnt(0)
10196 - If CU wavefront execution
10197 mode, omit vmcnt(0) and
10199 - Could be split into
10201 vmcnt(0), s_waitcnt
10202 vscnt(0), and s_waitcnt
10203 lgkmcnt(0) to allow
10205 independently moved
10208 - s_waitcnt lgkmcnt(0) must
10215 ordering of seq_cst
10221 lgkmcnt(0) and so do
10224 - s_waitcnt vmcnt(0)
10227 global/generic load
10229 atomicrmw-with-return-value
10231 ordering of seq_cst
10240 - s_waitcnt vscnt(0)
10243 global/generic store
10245 atomicrmw-no-return-value
10247 ordering of seq_cst
10259 consistent global/local
10260 memory instructions
10266 prevents reordering
10269 seq_cst load. (Note
10275 followed by a store
10282 release followed by
10285 order. The s_waitcnt
10286 could be placed after
10287 seq_store or before
10290 make the s_waitcnt be
10291 as late as possible
10297 instructions same as
10300 except must generated
10301 all instructions even
10303 load atomic seq_cst - workgroup - local
10305 1. s_waitcnt vmcnt(0) & vscnt(0)
10307 - If CU wavefront execution
10309 - Could be split into
10311 vmcnt(0) and s_waitcnt
10314 independently moved
10317 - s_waitcnt vmcnt(0)
10320 global/generic load
10322 atomicrmw-with-return-value
10324 ordering of seq_cst
10333 - s_waitcnt vscnt(0)
10336 global/generic store
10338 atomicrmw-no-return-value
10340 ordering of seq_cst
10353 memory instructions
10359 prevents reordering
10362 seq_cst load. (Note
10368 followed by a store
10375 release followed by
10378 order. The s_waitcnt
10379 could be placed after
10380 seq_store or before
10383 make the s_waitcnt be
10384 as late as possible
10390 instructions same as
10393 except must generated
10394 all instructions even
10397 load atomic seq_cst - agent - global 1. s_waitcnt lgkmcnt(0) &
10398 - system - generic vmcnt(0) & vscnt(0)
10400 - Could be split into
10402 vmcnt(0), s_waitcnt
10403 vscnt(0) and s_waitcnt
10404 lgkmcnt(0) to allow
10406 independently moved
10409 - s_waitcnt lgkmcnt(0)
10416 ordering of seq_cst
10422 lgkmcnt(0) and so do
10425 - s_waitcnt vmcnt(0)
10428 global/generic load
10430 atomicrmw-with-return-value
10432 ordering of seq_cst
10441 - s_waitcnt vscnt(0)
10444 global/generic store
10446 atomicrmw-no-return-value
10448 ordering of seq_cst
10461 memory instructions
10467 prevents reordering
10470 seq_cst load. (Note
10476 followed by a store
10483 release followed by
10486 order. The s_waitcnt
10487 could be placed after
10488 seq_store or before
10491 make the s_waitcnt be
10492 as late as possible
10498 instructions same as
10501 except must generated
10502 all instructions even
10504 store atomic seq_cst - singlethread - global *Same as corresponding
10505 - wavefront - local store atomic release,
10506 - workgroup - generic except must generated
10507 - agent all instructions even
10508 - system for OpenCL.*
10509 atomicrmw seq_cst - singlethread - global *Same as corresponding
10510 - wavefront - local atomicrmw acq_rel,
10511 - workgroup - generic except must generated
10512 - agent all instructions even
10513 - system for OpenCL.*
10514 fence seq_cst - singlethread *none* *Same as corresponding
10515 - wavefront fence acq_rel,
10516 - workgroup except must generated
10517 - agent all instructions even
10518 - system for OpenCL.*
10519 ============ ============ ============== ========== ================================
10524 For code objects generated by the AMDGPU backend for HSA [HSA]_ compatible
10525 runtimes (see :ref:`amdgpu-os`), the runtime installs a trap handler that
10526 supports the ``s_trap`` instruction. For usage see:
10528 - :ref:`amdgpu-trap-handler-for-amdhsa-os-v2-table`
10529 - :ref:`amdgpu-trap-handler-for-amdhsa-os-v3-table`
10530 - :ref:`amdgpu-trap-handler-for-amdhsa-os-v4-table`
10532 .. table:: AMDGPU Trap Handler for AMDHSA OS Code Object V2
10533 :name: amdgpu-trap-handler-for-amdhsa-os-v2-table
10535 =================== =============== =============== =======================================
10536 Usage Code Sequence Trap Handler Description
10538 =================== =============== =============== =======================================
10539 reserved ``s_trap 0x00`` Reserved by hardware.
10540 ``debugtrap(arg)`` ``s_trap 0x01`` ``SGPR0-1``: Reserved for Finalizer HSA ``debugtrap``
10541 ``queue_ptr`` intrinsic (not implemented).
10544 ``llvm.trap`` ``s_trap 0x02`` ``SGPR0-1``: Causes wave to be halted with the PC at
10545 ``queue_ptr`` the trap instruction. The associated
10546 queue is signalled to put it into the
10547 error state. When the queue is put in
10548 the error state, the waves executing
10549 dispatches on the queue will be
10551 ``llvm.debugtrap`` ``s_trap 0x03`` *none* - If debugger not enabled then behaves
10552 as a no-operation. The trap handler
10553 is entered and immediately returns to
10554 continue execution of the wavefront.
10555 - If the debugger is enabled, causes
10556 the debug trap to be reported by the
10557 debugger and the wavefront is put in
10558 the halt state with the PC at the
10559 instruction. The debugger must
10560 increment the PC and resume the wave.
10561 reserved ``s_trap 0x04`` Reserved.
10562 reserved ``s_trap 0x05`` Reserved.
10563 reserved ``s_trap 0x06`` Reserved.
10564 reserved ``s_trap 0x07`` Reserved.
10565 reserved ``s_trap 0x08`` Reserved.
10566 reserved ``s_trap 0xfe`` Reserved.
10567 reserved ``s_trap 0xff`` Reserved.
10568 =================== =============== =============== =======================================
10572 .. table:: AMDGPU Trap Handler for AMDHSA OS Code Object V3
10573 :name: amdgpu-trap-handler-for-amdhsa-os-v3-table
10575 =================== =============== =============== =======================================
10576 Usage Code Sequence Trap Handler Description
10578 =================== =============== =============== =======================================
10579 reserved ``s_trap 0x00`` Reserved by hardware.
10580 debugger breakpoint ``s_trap 0x01`` *none* Reserved for debugger to use for
10581 breakpoints. Causes wave to be halted
10582 with the PC at the trap instruction.
10583 The debugger is responsible to resume
10584 the wave, including the instruction
10585 that the breakpoint overwrote.
10586 ``llvm.trap`` ``s_trap 0x02`` ``SGPR0-1``: Causes wave to be halted with the PC at
10587 ``queue_ptr`` the trap instruction. The associated
10588 queue is signalled to put it into the
10589 error state. When the queue is put in
10590 the error state, the waves executing
10591 dispatches on the queue will be
10593 ``llvm.debugtrap`` ``s_trap 0x03`` *none* - If debugger not enabled then behaves
10594 as a no-operation. The trap handler
10595 is entered and immediately returns to
10596 continue execution of the wavefront.
10597 - If the debugger is enabled, causes
10598 the debug trap to be reported by the
10599 debugger and the wavefront is put in
10600 the halt state with the PC at the
10601 instruction. The debugger must
10602 increment the PC and resume the wave.
10603 reserved ``s_trap 0x04`` Reserved.
10604 reserved ``s_trap 0x05`` Reserved.
10605 reserved ``s_trap 0x06`` Reserved.
10606 reserved ``s_trap 0x07`` Reserved.
10607 reserved ``s_trap 0x08`` Reserved.
10608 reserved ``s_trap 0xfe`` Reserved.
10609 reserved ``s_trap 0xff`` Reserved.
10610 =================== =============== =============== =======================================
10614 .. table:: AMDGPU Trap Handler for AMDHSA OS Code Object V4
10615 :name: amdgpu-trap-handler-for-amdhsa-os-v4-table
10617 =================== =============== ================ ================= =======================================
10618 Usage Code Sequence GFX6-GFX8 Inputs GFX9-GFX10 Inputs Description
10619 =================== =============== ================ ================= =======================================
10620 reserved ``s_trap 0x00`` Reserved by hardware.
10621 debugger breakpoint ``s_trap 0x01`` *none* *none* Reserved for debugger to use for
10622 breakpoints. Causes wave to be halted
10623 with the PC at the trap instruction.
10624 The debugger is responsible to resume
10625 the wave, including the instruction
10626 that the breakpoint overwrote.
10627 ``llvm.trap`` ``s_trap 0x02`` ``SGPR0-1``: *none* Causes wave to be halted with the PC at
10628 ``queue_ptr`` the trap instruction. The associated
10629 queue is signalled to put it into the
10630 error state. When the queue is put in
10631 the error state, the waves executing
10632 dispatches on the queue will be
10634 ``llvm.debugtrap`` ``s_trap 0x03`` *none* *none* - If debugger not enabled then behaves
10635 as a no-operation. The trap handler
10636 is entered and immediately returns to
10637 continue execution of the wavefront.
10638 - If the debugger is enabled, causes
10639 the debug trap to be reported by the
10640 debugger and the wavefront is put in
10641 the halt state with the PC at the
10642 instruction. The debugger must
10643 increment the PC and resume the wave.
10644 reserved ``s_trap 0x04`` Reserved.
10645 reserved ``s_trap 0x05`` Reserved.
10646 reserved ``s_trap 0x06`` Reserved.
10647 reserved ``s_trap 0x07`` Reserved.
10648 reserved ``s_trap 0x08`` Reserved.
10649 reserved ``s_trap 0xfe`` Reserved.
10650 reserved ``s_trap 0xff`` Reserved.
10651 =================== =============== ================ ================= =======================================
10653 .. _amdgpu-amdhsa-function-call-convention:
10660 This section is currently incomplete and has inaccuracies. It is WIP that will
10661 be updated as information is determined.
10663 See :ref:`amdgpu-dwarf-address-space-identifier` for information on swizzled
10664 addresses. Unswizzled addresses are normal linear addresses.
10666 .. _amdgpu-amdhsa-function-call-convention-kernel-functions:
10671 This section describes the call convention ABI for the outer kernel function.
10673 See :ref:`amdgpu-amdhsa-initial-kernel-execution-state` for the kernel call
10676 The following is not part of the AMDGPU kernel calling convention but describes
10677 how the AMDGPU implements function calls:
10679 1. Clang decides the kernarg layout to match the *HSA Programmer's Language
10682 - All structs are passed directly.
10683 - Lambda values are passed *TBA*.
10687 - Does this really follow HSA rules? Or are structs >16 bytes passed
10689 - What is ABI for lambda values?
10691 4. The kernel performs certain setup in its prolog, as described in
10692 :ref:`amdgpu-amdhsa-kernel-prolog`.
10694 .. _amdgpu-amdhsa-function-call-convention-non-kernel-functions:
10696 Non-Kernel Functions
10697 ++++++++++++++++++++
10699 This section describes the call convention ABI for functions other than the
10700 outer kernel function.
10702 If a kernel has function calls then scratch is always allocated and used for
10703 the call stack which grows from low address to high address using the swizzled
10704 scratch address space.
10706 On entry to a function:
10708 1. SGPR0-3 contain a V# with the following properties (see
10709 :ref:`amdgpu-amdhsa-kernel-prolog-private-segment-buffer`):
10711 * Base address pointing to the beginning of the wavefront scratch backing
10713 * Swizzled with dword element size and stride of wavefront size elements.
10715 2. The FLAT_SCRATCH register pair is setup. See
10716 :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`.
10717 3. GFX6-GFX8: M0 register set to the size of LDS in bytes. See
10718 :ref:`amdgpu-amdhsa-kernel-prolog-m0`.
10719 4. The EXEC register is set to the lanes active on entry to the function.
10720 5. MODE register: *TBD*
10721 6. VGPR0-31 and SGPR4-29 are used to pass function input arguments as described
10723 7. SGPR30-31 return address (RA). The code address that the function must
10724 return to when it completes. The value is undefined if the function is *no
10726 8. SGPR32 is used for the stack pointer (SP). It is an unswizzled scratch
10727 offset relative to the beginning of the wavefront scratch backing memory.
10729 The unswizzled SP can be used with buffer instructions as an unswizzled SGPR
10730 offset with the scratch V# in SGPR0-3 to access the stack in a swizzled
10733 The unswizzled SP value can be converted into the swizzled SP value by:
10735 | swizzled SP = unswizzled SP / wavefront size
10737 This may be used to obtain the private address space address of stack
10738 objects and to convert this address to a flat address by adding the flat
10739 scratch aperture base address.
10741 The swizzled SP value is always 4 bytes aligned for the ``r600``
10742 architecture and 16 byte aligned for the ``amdgcn`` architecture.
10746 The ``amdgcn`` value is selected to avoid dynamic stack alignment for the
10747 OpenCL language which has the largest base type defined as 16 bytes.
10749 On entry, the swizzled SP value is the address of the first function
10750 argument passed on the stack. Other stack passed arguments are positive
10751 offsets from the entry swizzled SP value.
10753 The function may use positive offsets beyond the last stack passed argument
10754 for stack allocated local variables and register spill slots. If necessary,
10755 the function may align these to greater alignment than 16 bytes. After these
10756 the function may dynamically allocate space for such things as runtime sized
10757 ``alloca`` local allocations.
10759 If the function calls another function, it will place any stack allocated
10760 arguments after the last local allocation and adjust SGPR32 to the address
10761 after the last local allocation.
10763 9. All other registers are unspecified.
10764 10. Any necessary ``s_waitcnt`` has been performed to ensure memory is available
10767 On exit from a function:
10769 1. VGPR0-31 and SGPR4-29 are used to pass function result arguments as
10770 described below. Any registers used are considered clobbered registers.
10771 2. The following registers are preserved and have the same value as on entry:
10776 * All SGPR registers except the clobbered registers of SGPR4-31.
10794 Except the argument registers, the VGPRs clobbered and the preserved
10795 registers are intermixed at regular intervals in order to keep a
10796 similar ratio independent of the number of allocated VGPRs.
10798 * Lanes of all VGPRs that are inactive at the call site.
10800 For the AMDGPU backend, an inter-procedural register allocation (IPRA)
10801 optimization may mark some of clobbered SGPR and VGPR registers as
10802 preserved if it can be determined that the called function does not change
10805 2. The PC is set to the RA provided on entry.
10806 3. MODE register: *TBD*.
10807 4. All other registers are clobbered.
10808 5. Any necessary ``s_waitcnt`` has been performed to ensure memory accessed by
10809 function is available to the caller.
10813 - On gfx908 are all ACC registers clobbered?
10815 - How are function results returned? The address of structured types is passed
10816 by reference, but what about other types?
10818 The function input arguments are made up of the formal arguments explicitly
10819 declared by the source language function plus the implicit input arguments used
10820 by the implementation.
10822 The source language input arguments are:
10824 1. Any source language implicit ``this`` or ``self`` argument comes first as a
10826 2. Followed by the function formal arguments in left to right source order.
10828 The source language result arguments are:
10830 1. The function result argument.
10832 The source language input or result struct type arguments that are less than or
10833 equal to 16 bytes, are decomposed recursively into their base type fields, and
10834 each field is passed as if a separate argument. For input arguments, if the
10835 called function requires the struct to be in memory, for example because its
10836 address is taken, then the function body is responsible for allocating a stack
10837 location and copying the field arguments into it. Clang terms this *direct
10840 The source language input struct type arguments that are greater than 16 bytes,
10841 are passed by reference. The caller is responsible for allocating a stack
10842 location to make a copy of the struct value and pass the address as the input
10843 argument. The called function is responsible to perform the dereference when
10844 accessing the input argument. Clang terms this *by-value struct*.
10846 A source language result struct type argument that is greater than 16 bytes, is
10847 returned by reference. The caller is responsible for allocating a stack location
10848 to hold the result value and passes the address as the last input argument
10849 (before the implicit input arguments). In this case there are no result
10850 arguments. The called function is responsible to perform the dereference when
10851 storing the result value. Clang terms this *structured return (sret)*.
10853 *TODO: correct the ``sret`` definition.*
10857 Is this definition correct? Or is ``sret`` only used if passing in registers, and
10858 pass as non-decomposed struct as stack argument? Or something else? Is the
10859 memory location in the caller stack frame, or a stack memory argument and so
10860 no address is passed as the caller can directly write to the argument stack
10861 location? But then the stack location is still live after return. If an
10862 argument stack location is it the first stack argument or the last one?
10864 Lambda argument types are treated as struct types with an implementation defined
10869 Need to specify the ABI for lambda types for AMDGPU.
10871 For AMDGPU backend all source language arguments (including the decomposed
10872 struct type arguments) are passed in VGPRs unless marked ``inreg`` in which case
10873 they are passed in SGPRs.
10875 The AMDGPU backend walks the function call graph from the leaves to determine
10876 which implicit input arguments are used, propagating to each caller of the
10877 function. The used implicit arguments are appended to the function arguments
10878 after the source language arguments in the following order:
10882 Is recursion or external functions supported?
10884 1. Work-Item ID (1 VGPR)
10886 The X, Y and Z work-item ID are packed into a single VGRP with the following
10887 layout. Only fields actually used by the function are set. The other bits
10890 The values come from the initial kernel execution state. See
10891 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`.
10893 .. table:: Work-item implicit argument layout
10894 :name: amdgpu-amdhsa-workitem-implicit-argument-layout-table
10896 ======= ======= ==============
10897 Bits Size Field Name
10898 ======= ======= ==============
10899 9:0 10 bits X Work-Item ID
10900 19:10 10 bits Y Work-Item ID
10901 29:20 10 bits Z Work-Item ID
10902 31:30 2 bits Unused
10903 ======= ======= ==============
10905 2. Dispatch Ptr (2 SGPRs)
10907 The value comes from the initial kernel execution state. See
10908 :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
10910 3. Queue Ptr (2 SGPRs)
10912 The value comes from the initial kernel execution state. See
10913 :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
10915 4. Kernarg Segment Ptr (2 SGPRs)
10917 The value comes from the initial kernel execution state. See
10918 :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
10920 5. Dispatch id (2 SGPRs)
10922 The value comes from the initial kernel execution state. See
10923 :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
10925 6. Work-Group ID X (1 SGPR)
10927 The value comes from the initial kernel execution state. See
10928 :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
10930 7. Work-Group ID Y (1 SGPR)
10932 The value comes from the initial kernel execution state. See
10933 :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
10935 8. Work-Group ID Z (1 SGPR)
10937 The value comes from the initial kernel execution state. See
10938 :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
10940 9. Implicit Argument Ptr (2 SGPRs)
10942 The value is computed by adding an offset to Kernarg Segment Ptr to get the
10943 global address space pointer to the first kernarg implicit argument.
10945 The input and result arguments are assigned in order in the following manner:
10949 There are likely some errors and omissions in the following description that
10954 Check the Clang source code to decipher how function arguments and return
10955 results are handled. Also see the AMDGPU specific values used.
10957 * VGPR arguments are assigned to consecutive VGPRs starting at VGPR0 up to
10960 If there are more arguments than will fit in these registers, the remaining
10961 arguments are allocated on the stack in order on naturally aligned
10966 How are overly aligned structures allocated on the stack?
10968 * SGPR arguments are assigned to consecutive SGPRs starting at SGPR0 up to
10971 If there are more arguments than will fit in these registers, the remaining
10972 arguments are allocated on the stack in order on naturally aligned
10975 Note that decomposed struct type arguments may have some fields passed in
10976 registers and some in memory.
10980 So, a struct which can pass some fields as decomposed register arguments, will
10981 pass the rest as decomposed stack elements? But an argument that will not start
10982 in registers will not be decomposed and will be passed as a non-decomposed
10985 The following is not part of the AMDGPU function calling convention but
10986 describes how the AMDGPU implements function calls:
10988 1. SGPR33 is used as a frame pointer (FP) if necessary. Like the SP it is an
10989 unswizzled scratch address. It is only needed if runtime sized ``alloca``
10990 are used, or for the reasons defined in ``SIFrameLowering``.
10991 2. Runtime stack alignment is supported. SGPR34 is used as a base pointer (BP)
10992 to access the incoming stack arguments in the function. The BP is needed
10993 only when the function requires the runtime stack alignment.
10995 3. Allocating SGPR arguments on the stack are not supported.
10997 4. No CFI is currently generated. See
10998 :ref:`amdgpu-dwarf-call-frame-information`.
11002 CFI will be generated that defines the CFA as the unswizzled address
11003 relative to the wave scratch base in the unswizzled private address space
11004 of the lowest address stack allocated local variable.
11006 ``DW_AT_frame_base`` will be defined as the swizzled address in the
11007 swizzled private address space by dividing the CFA by the wavefront size
11008 (since CFA is always at least dword aligned which matches the scratch
11009 swizzle element size).
11011 If no dynamic stack alignment was performed, the stack allocated arguments
11012 are accessed as negative offsets relative to ``DW_AT_frame_base``, and the
11013 local variables and register spill slots are accessed as positive offsets
11014 relative to ``DW_AT_frame_base``.
11016 5. Function argument passing is implemented by copying the input physical
11017 registers to virtual registers on entry. The register allocator can spill if
11018 necessary. These are copied back to physical registers at call sites. The
11019 net effect is that each function call can have these values in entirely
11020 distinct locations. The IPRA can help avoid shuffling argument registers.
11021 6. Call sites are implemented by setting up the arguments at positive offsets
11022 from SP. Then SP is incremented to account for the known frame size before
11023 the call and decremented after the call.
11027 The CFI will reflect the changed calculation needed to compute the CFA
11030 7. 4 byte spill slots are used in the stack frame. One slot is allocated for an
11031 emergency spill slot. Buffer instructions are used for stack accesses and
11032 not the ``flat_scratch`` instruction.
11036 Explain when the emergency spill slot is used.
11040 Possible broken issues:
11042 - Stack arguments must be aligned to required alignment.
11043 - Stack is aligned to max(16, max formal argument alignment)
11044 - Direct argument < 64 bits should check register budget.
11045 - Register budget calculation should respect ``inreg`` for SGPR.
11046 - SGPR overflow is not handled.
11047 - struct with 1 member unpeeling is not checking size of member.
11048 - ``sret`` is after ``this`` pointer.
11049 - Caller is not implementing stack realignment: need an extra pointer.
11050 - Should say AMDGPU passes FP rather than SP.
11051 - Should CFI define CFA as address of locals or arguments. Difference is
11052 apparent when have implemented dynamic alignment.
11053 - If ``SCRATCH`` instruction could allow negative offsets, then can make FP be
11054 highest address of stack frame and use negative offset for locals. Would
11055 allow SP to be the same as FP and could support signal-handler-like as now
11056 have a real SP for the top of the stack.
11057 - How is ``sret`` passed on the stack? In argument stack area? Can it overlay
11063 This section provides code conventions used when the target triple OS is
11064 ``amdpal`` (see :ref:`amdgpu-target-triples`).
11066 .. _amdgpu-amdpal-code-object-metadata-section:
11068 Code Object Metadata
11069 ~~~~~~~~~~~~~~~~~~~~
11073 The metadata is currently in development and is subject to major
11074 changes. Only the current version is supported. *When this document
11075 was generated the version was 2.6.*
11077 Code object metadata is specified by the ``NT_AMDGPU_METADATA`` note
11078 record (see :ref:`amdgpu-note-records-v3-v4`).
11080 The metadata is represented as Message Pack formatted binary data (see
11081 [MsgPack]_). The top level is a Message Pack map that includes the keys
11082 defined in table :ref:`amdgpu-amdpal-code-object-metadata-map-table`
11083 and referenced tables.
11085 Additional information can be added to the maps. To avoid conflicts, any
11086 key names should be prefixed by "*vendor-name*." where ``vendor-name``
11087 can be the name of the vendor and specific vendor tool that generates the
11088 information. The prefix is abbreviated to simply "." when it appears
11089 within a map that has been added by the same *vendor-name*.
11091 .. table:: AMDPAL Code Object Metadata Map
11092 :name: amdgpu-amdpal-code-object-metadata-map-table
11094 =================== ============== ========= ======================================================================
11095 String Key Value Type Required? Description
11096 =================== ============== ========= ======================================================================
11097 "amdpal.version" sequence of Required PAL code object metadata (major, minor) version. The current values
11098 2 integers are defined by *Util::Abi::PipelineMetadata(Major|Minor)Version*.
11099 "amdpal.pipelines" sequence of Required Per-pipeline metadata. See
11100 map :ref:`amdgpu-amdpal-code-object-pipeline-metadata-map-table` for the
11101 definition of the keys included in that map.
11102 =================== ============== ========= ======================================================================
11106 .. table:: AMDPAL Code Object Pipeline Metadata Map
11107 :name: amdgpu-amdpal-code-object-pipeline-metadata-map-table
11109 ====================================== ============== ========= ===================================================
11110 String Key Value Type Required? Description
11111 ====================================== ============== ========= ===================================================
11112 ".name" string Source name of the pipeline.
11113 ".type" string Pipeline type, e.g. VsPs. Values include:
11123 ".internal_pipeline_hash" sequence of Required Internal compiler hash for this pipeline. Lower
11124 2 integers 64 bits is the "stable" portion of the hash, used
11125 for e.g. shader replacement lookup. Upper 64 bits
11126 is the "unique" portion of the hash, used for
11127 e.g. pipeline cache lookup. The value is
11128 implementation defined, and can not be relied on
11129 between different builds of the compiler.
11130 ".shaders" map Per-API shader metadata. See
11131 :ref:`amdgpu-amdpal-code-object-shader-map-table`
11132 for the definition of the keys included in that
11134 ".hardware_stages" map Per-hardware stage metadata. See
11135 :ref:`amdgpu-amdpal-code-object-hardware-stage-map-table`
11136 for the definition of the keys included in that
11138 ".shader_functions" map Per-shader function metadata. See
11139 :ref:`amdgpu-amdpal-code-object-shader-function-map-table`
11140 for the definition of the keys included in that
11142 ".registers" map Required Hardware register configuration. See
11143 :ref:`amdgpu-amdpal-code-object-register-map-table`
11144 for the definition of the keys included in that
11146 ".user_data_limit" integer Number of user data entries accessed by this
11148 ".spill_threshold" integer The user data spill threshold. 0xFFFF for
11149 NoUserDataSpilling.
11150 ".uses_viewport_array_index" boolean Indicates whether or not the pipeline uses the
11151 viewport array index feature. Pipelines which use
11152 this feature can render into all 16 viewports,
11153 whereas pipelines which do not use it are
11154 restricted to viewport #0.
11155 ".es_gs_lds_size" integer Size in bytes of LDS space used internally for
11156 handling data-passing between the ES and GS
11157 shader stages. This can be zero if the data is
11158 passed using off-chip buffers. This value should
11159 be used to program all user-SGPRs which have been
11160 marked with "UserDataMapping::EsGsLdsSize"
11161 (typically only the GS and VS HW stages will ever
11162 have a user-SGPR so marked).
11163 ".nggSubgroupSize" integer Explicit maximum subgroup size for NGG shaders
11164 (maximum number of threads in a subgroup).
11165 ".num_interpolants" integer Graphics only. Number of PS interpolants.
11166 ".mesh_scratch_memory_size" integer Max mesh shader scratch memory used.
11167 ".api" string Name of the client graphics API.
11168 ".api_create_info" binary Graphics API shader create info binary blob. Can
11169 be defined by the driver using the compiler if
11170 they want to be able to correlate API-specific
11171 information used during creation at a later time.
11172 ====================================== ============== ========= ===================================================
11176 .. table:: AMDPAL Code Object Shader Map
11177 :name: amdgpu-amdpal-code-object-shader-map-table
11180 +-------------+--------------+-------------------------------------------------------------------+
11181 |String Key |Value Type |Description |
11182 +=============+==============+===================================================================+
11183 |- ".compute" |map |See :ref:`amdgpu-amdpal-code-object-api-shader-metadata-map-table` |
11184 |- ".vertex" | |for the definition of the keys included in that map. |
11187 |- ".geometry"| | |
11189 +-------------+--------------+-------------------------------------------------------------------+
11193 .. table:: AMDPAL Code Object API Shader Metadata Map
11194 :name: amdgpu-amdpal-code-object-api-shader-metadata-map-table
11196 ==================== ============== ========= =====================================================================
11197 String Key Value Type Required? Description
11198 ==================== ============== ========= =====================================================================
11199 ".api_shader_hash" sequence of Required Input shader hash, typically passed in from the client. The value
11200 2 integers is implementation defined, and can not be relied on between
11201 different builds of the compiler.
11202 ".hardware_mapping" sequence of Required Flags indicating the HW stages this API shader maps to. Values
11213 ==================== ============== ========= =====================================================================
11217 .. table:: AMDPAL Code Object Hardware Stage Map
11218 :name: amdgpu-amdpal-code-object-hardware-stage-map-table
11220 +-------------+--------------+-----------------------------------------------------------------------+
11221 |String Key |Value Type |Description |
11222 +=============+==============+=======================================================================+
11223 |- ".ls" |map |See :ref:`amdgpu-amdpal-code-object-hardware-stage-metadata-map-table` |
11224 |- ".hs" | |for the definition of the keys included in that map. |
11230 +-------------+--------------+-----------------------------------------------------------------------+
11234 .. table:: AMDPAL Code Object Hardware Stage Metadata Map
11235 :name: amdgpu-amdpal-code-object-hardware-stage-metadata-map-table
11237 ========================== ============== ========= ===============================================================
11238 String Key Value Type Required? Description
11239 ========================== ============== ========= ===============================================================
11240 ".entry_point" string The ELF symbol pointing to this pipeline's stage entry point.
11241 ".scratch_memory_size" integer Scratch memory size in bytes.
11242 ".lds_size" integer Local Data Share size in bytes.
11243 ".perf_data_buffer_size" integer Performance data buffer size in bytes.
11244 ".vgpr_count" integer Number of VGPRs used.
11245 ".sgpr_count" integer Number of SGPRs used.
11246 ".vgpr_limit" integer If non-zero, indicates the shader was compiled with a
11247 directive to instruct the compiler to limit the VGPR usage to
11248 be less than or equal to the specified value (only set if
11249 different from HW default).
11250 ".sgpr_limit" integer SGPR count upper limit (only set if different from HW
11252 ".threadgroup_dimensions" sequence of Thread-group X/Y/Z dimensions (Compute only).
11254 ".wavefront_size" integer Wavefront size (only set if different from HW default).
11255 ".uses_uavs" boolean The shader reads or writes UAVs.
11256 ".uses_rovs" boolean The shader reads or writes ROVs.
11257 ".writes_uavs" boolean The shader writes to one or more UAVs.
11258 ".writes_depth" boolean The shader writes out a depth value.
11259 ".uses_append_consume" boolean The shader uses append and/or consume operations, either
11261 ".uses_prim_id" boolean The shader uses PrimID.
11262 ========================== ============== ========= ===============================================================
11266 .. table:: AMDPAL Code Object Shader Function Map
11267 :name: amdgpu-amdpal-code-object-shader-function-map-table
11269 =============== ============== ====================================================================
11270 String Key Value Type Description
11271 =============== ============== ====================================================================
11272 *symbol name* map *symbol name* is the ELF symbol name of the shader function code
11273 entry address. The value is the function's metadata. See
11274 :ref:`amdgpu-amdpal-code-object-shader-function-metadata-map-table`.
11275 =============== ============== ====================================================================
11279 .. table:: AMDPAL Code Object Shader Function Metadata Map
11280 :name: amdgpu-amdpal-code-object-shader-function-metadata-map-table
11282 ============================= ============== =================================================================
11283 String Key Value Type Description
11284 ============================= ============== =================================================================
11285 ".api_shader_hash" sequence of Input shader hash, typically passed in from the client. The value
11286 2 integers is implementation defined, and can not be relied on between
11287 different builds of the compiler.
11288 ".scratch_memory_size" integer Size in bytes of scratch memory used by the shader.
11289 ".lds_size" integer Size in bytes of LDS memory.
11290 ".vgpr_count" integer Number of VGPRs used by the shader.
11291 ".sgpr_count" integer Number of SGPRs used by the shader.
11292 ".stack_frame_size_in_bytes" integer Amount of stack size used by the shader.
11293 ".shader_subtype" string Shader subtype/kind. Values include:
11297 ============================= ============== =================================================================
11301 .. table:: AMDPAL Code Object Register Map
11302 :name: amdgpu-amdpal-code-object-register-map-table
11304 ========================== ============== ====================================================================
11305 32-bit Integer Key Value Type Description
11306 ========================== ============== ====================================================================
11307 ``reg offset`` 32-bit integer ``reg offset`` is the dword offset into the GFXIP register space of
11308 a GRBM register (i.e., driver accessible GPU register number, not
11309 shader GPR register number). The driver is required to program each
11310 specified register to the corresponding specified value when
11311 executing this pipeline. Typically, the ``reg offsets`` are the
11312 ``uint16_t`` offsets to each register as defined by the hardware
11313 chip headers. The register is set to the provided value. However, a
11314 ``reg offset`` that specifies a user data register (e.g.,
11315 COMPUTE_USER_DATA_0) needs special treatment. See
11316 :ref:`amdgpu-amdpal-code-object-user-data-section` section for more
11318 ========================== ============== ====================================================================
11320 .. _amdgpu-amdpal-code-object-user-data-section:
11325 Each hardware stage has a set of 32-bit physical SPI *user data registers*
11326 (either 16 or 32 based on graphics IP and the stage) which can be
11327 written from a command buffer and then loaded into SGPRs when waves are
11328 launched via a subsequent dispatch or draw operation. This is the way
11329 most arguments are passed from the application/runtime to a hardware
11332 PAL abstracts this functionality by exposing a set of 128 *user data
11333 entries* per pipeline a client can use to pass arguments from a command
11334 buffer to one or more shaders in that pipeline. The ELF code object must
11335 specify a mapping from virtualized *user data entries* to physical *user
11336 data registers*, and PAL is responsible for implementing that mapping,
11337 including spilling overflow *user data entries* to memory if needed.
11339 Since the *user data registers* are GRBM-accessible SPI registers, this
11340 mapping is actually embedded in the ``.registers`` metadata entry. For
11341 most registers, the value in that map is a literal 32-bit value that
11342 should be written to the register by the driver. However, when the
11343 register is a *user data register* (any USER_DATA register e.g.,
11344 SPI_SHADER_USER_DATA_PS_5), the value is instead an encoding that tells
11345 the driver to write either a *user data entry* value or one of several
11346 driver-internal values to the register. This encoding is described in
11347 the following table:
11351 Currently, *user data registers* 0 and 1 (e.g., SPI_SHADER_USER_DATA_PS_0,
11352 and SPI_SHADER_USER_DATA_PS_1) are reserved. *User data register* 0 must
11353 always be programmed to the address of the GlobalTable, and *user data
11354 register* 1 must always be programmed to the address of the PerShaderTable.
11358 .. table:: AMDPAL User Data Mapping
11359 :name: amdgpu-amdpal-code-object-metadata-user-data-mapping-table
11361 ========== ================= ===============================================================================
11362 Value Name Description
11363 ========== ================= ===============================================================================
11364 0..127 *User Data Entry* 32-bit value of user_data_entry[N] as specified via *CmdSetUserData()*
11365 0x10000000 GlobalTable 32-bit pointer to GPU memory containing the global internal table (should
11366 always point to *user data register* 0).
11367 0x10000001 PerShaderTable 32-bit pointer to GPU memory containing the per-shader internal table. See
11368 :ref:`amdgpu-amdpal-code-object-metadata-user-data-per-shader-table-section`
11369 for more detail (should always point to *user data register* 1).
11370 0x10000002 SpillTable 32-bit pointer to GPU memory containing the user data spill table. See
11371 :ref:`amdgpu-amdpal-code-object-metadata-user-data-spill-table-section` for
11373 0x10000003 BaseVertex Vertex offset (32-bit unsigned integer). Not needed if the pipeline doesn't
11374 reference the draw index in the vertex shader. Only supported by the first
11375 stage in a graphics pipeline.
11376 0x10000004 BaseInstance Instance offset (32-bit unsigned integer). Only supported by the first stage in
11377 a graphics pipeline.
11378 0x10000005 DrawIndex Draw index (32-bit unsigned integer). Only supported by the first stage in a
11380 0x10000006 Workgroup Thread group count (32-bit unsigned integer). Low half of a 64-bit address of
11381 a buffer containing the grid dimensions for a Compute dispatch operation. The
11382 high half of the address is stored in the next sequential user-SGPR. Only
11383 supported by compute pipelines.
11384 0x1000000A EsGsLdsSize Indicates that PAL will program this user-SGPR to contain the amount of LDS
11385 space used for the ES/GS pseudo-ring-buffer for passing data between shader
11387 0x1000000B ViewId View id (32-bit unsigned integer) identifies a view of graphic
11388 pipeline instancing.
11389 0x1000000C StreamOutTable 32-bit pointer to GPU memory containing the stream out target SRD table. This
11390 can only appear for one shader stage per pipeline.
11391 0x1000000D PerShaderPerfData 32-bit pointer to GPU memory containing the per-shader performance data buffer.
11392 0x1000000F VertexBufferTable 32-bit pointer to GPU memory containing the vertex buffer SRD table. This can
11393 only appear for one shader stage per pipeline.
11394 0x10000010 UavExportTable 32-bit pointer to GPU memory containing the UAV export SRD table. This can
11395 only appear for one shader stage per pipeline (PS). These replace color targets
11396 and are completely separate from any UAVs used by the shader. This is optional,
11397 and only used by the PS when UAV exports are used to replace color-target
11398 exports to optimize specific shaders.
11399 0x10000011 NggCullingData 64-bit pointer to GPU memory containing the hardware register data needed by
11400 some NGG pipelines to perform culling. This value contains the address of the
11401 first of two consecutive registers which provide the full GPU address.
11402 0x10000015 FetchShaderPtr 64-bit pointer to GPU memory containing the fetch shader subroutine.
11403 ========== ================= ===============================================================================
11405 .. _amdgpu-amdpal-code-object-metadata-user-data-per-shader-table-section:
11410 Low 32 bits of the GPU address for an optional buffer in the ``.data``
11411 section of the ELF. The high 32 bits of the address match the high 32 bits
11412 of the shader's program counter.
11414 The buffer can be anything the shader compiler needs it for, and
11415 allows each shader to have its own region of the ``.data`` section.
11416 Typically, this could be a table of buffer SRD's and the data pointed to
11417 by the buffer SRD's, but it could be a flat-address region of memory as
11418 well. Its layout and usage are defined by the shader compiler.
11420 Each shader's table in the ``.data`` section is referenced by the symbol
11421 ``_amdgpu_``\ *xs*\ ``_shdr_intrl_data`` where *xs* corresponds with the
11422 hardware shader stage the data is for. E.g.,
11423 ``_amdgpu_cs_shdr_intrl_data`` for the compute shader hardware stage.
11425 .. _amdgpu-amdpal-code-object-metadata-user-data-spill-table-section:
11430 It is possible for a hardware shader to need access to more *user data
11431 entries* than there are slots available in user data registers for one
11432 or more hardware shader stages. In that case, the PAL runtime expects
11433 the necessary *user data entries* to be spilled to GPU memory and use
11434 one user data register to point to the spilled user data memory. The
11435 value of the *user data entry* must then represent the location where
11436 a shader expects to read the low 32-bits of the table's GPU virtual
11437 address. The *spill table* itself represents a set of 32-bit values
11438 managed by the PAL runtime in GPU-accessible memory that can be made
11439 indirectly accessible to a hardware shader.
11444 This section provides code conventions used when the target triple OS is
11445 empty (see :ref:`amdgpu-target-triples`).
11450 For code objects generated by AMDGPU backend for non-amdhsa OS, the runtime does
11451 not install a trap handler. The ``llvm.trap`` and ``llvm.debugtrap``
11452 instructions are handled as follows:
11454 .. table:: AMDGPU Trap Handler for Non-AMDHSA OS
11455 :name: amdgpu-trap-handler-for-non-amdhsa-os-table
11457 =============== =============== ===========================================
11458 Usage Code Sequence Description
11459 =============== =============== ===========================================
11460 llvm.trap s_endpgm Causes wavefront to be terminated.
11461 llvm.debugtrap *none* Compiler warning given that there is no
11462 trap handler installed.
11463 =============== =============== ===========================================
11473 When the language is OpenCL the following differences occur:
11475 1. The OpenCL memory model is used (see :ref:`amdgpu-amdhsa-memory-model`).
11476 2. The AMDGPU backend appends additional arguments to the kernel's explicit
11477 arguments for the AMDHSA OS (see
11478 :ref:`opencl-kernel-implicit-arguments-appended-for-amdhsa-os-table`).
11479 3. Additional metadata is generated
11480 (see :ref:`amdgpu-amdhsa-code-object-metadata`).
11482 .. table:: OpenCL kernel implicit arguments appended for AMDHSA OS
11483 :name: opencl-kernel-implicit-arguments-appended-for-amdhsa-os-table
11485 ======== ==== ========= ===========================================
11486 Position Byte Byte Description
11488 ======== ==== ========= ===========================================
11489 1 8 8 OpenCL Global Offset X
11490 2 8 8 OpenCL Global Offset Y
11491 3 8 8 OpenCL Global Offset Z
11492 4 8 8 OpenCL address of printf buffer
11493 5 8 8 OpenCL address of virtual queue used by
11495 6 8 8 OpenCL address of AqlWrap struct used by
11497 7 8 8 Pointer argument used for Multi-gird
11499 ======== ==== ========= ===========================================
11506 When the language is HCC the following differences occur:
11508 1. The HSA memory model is used (see :ref:`amdgpu-amdhsa-memory-model`).
11510 .. _amdgpu-assembler:
11515 AMDGPU backend has LLVM-MC based assembler which is currently in development.
11516 It supports AMDGCN GFX6-GFX10.
11518 This section describes general syntax for instructions and operands.
11523 An instruction has the following :doc:`syntax<AMDGPUInstructionSyntax>`:
11525 | ``<``\ *opcode*\ ``> <``\ *operand0*\ ``>, <``\ *operand1*\ ``>,...
11526 <``\ *modifier0*\ ``> <``\ *modifier1*\ ``>...``
11528 :doc:`Operands<AMDGPUOperandSyntax>` are comma-separated while
11529 :doc:`modifiers<AMDGPUModifierSyntax>` are space-separated.
11531 The order of operands and modifiers is fixed.
11532 Most modifiers are optional and may be omitted.
11534 Links to detailed instruction syntax description may be found in the following
11535 table. Note that features under development are not included
11536 in this description.
11538 =================================== =======================================
11539 Core ISA ISA Extensions
11540 =================================== =======================================
11541 :doc:`GFX7<AMDGPU/AMDGPUAsmGFX7>` \-
11542 :doc:`GFX8<AMDGPU/AMDGPUAsmGFX8>` \-
11543 :doc:`GFX9<AMDGPU/AMDGPUAsmGFX9>` :doc:`gfx900<AMDGPU/AMDGPUAsmGFX900>`
11545 :doc:`gfx902<AMDGPU/AMDGPUAsmGFX900>`
11547 :doc:`gfx904<AMDGPU/AMDGPUAsmGFX904>`
11549 :doc:`gfx906<AMDGPU/AMDGPUAsmGFX906>`
11551 :doc:`gfx908<AMDGPU/AMDGPUAsmGFX908>`
11553 :doc:`gfx909<AMDGPU/AMDGPUAsmGFX900>`
11555 :doc:`gfx90a<AMDGPU/AMDGPUAsmGFX90a>`
11557 :doc:`GFX10<AMDGPU/AMDGPUAsmGFX10>` :doc:`gfx1011<AMDGPU/AMDGPUAsmGFX1011>`
11559 :doc:`gfx1012<AMDGPU/AMDGPUAsmGFX1011>`
11560 =================================== =======================================
11562 For more information about instructions, their semantics and supported
11563 combinations of operands, refer to one of instruction set architecture manuals
11564 [AMD-GCN-GFX6]_, [AMD-GCN-GFX7]_, [AMD-GCN-GFX8]_,
11565 [AMD-GCN-GFX900-GFX904-VEGA]_, [AMD-GCN-GFX906-VEGA7NM]_
11566 [AMD-GCN-GFX908-CDNA1]_, [AMD-GCN-GFX10-RDNA1]_ and [AMD-GCN-GFX10-RDNA2]_.
11571 Detailed description of operands may be found :doc:`here<AMDGPUOperandSyntax>`.
11576 Detailed description of modifiers may be found
11577 :doc:`here<AMDGPUModifierSyntax>`.
11579 Instruction Examples
11580 ~~~~~~~~~~~~~~~~~~~~
11585 .. code-block:: nasm
11587 ds_add_u32 v2, v4 offset:16
11588 ds_write_src2_b64 v2 offset0:4 offset1:8
11589 ds_cmpst_f32 v2, v4, v6
11590 ds_min_rtn_f64 v[8:9], v2, v[4:5]
11592 For full list of supported instructions, refer to "LDS/GDS instructions" in ISA
11598 .. code-block:: nasm
11600 flat_load_dword v1, v[3:4]
11601 flat_store_dwordx3 v[3:4], v[5:7]
11602 flat_atomic_swap v1, v[3:4], v5 glc
11603 flat_atomic_cmpswap v1, v[3:4], v[5:6] glc slc
11604 flat_atomic_fmax_x2 v[1:2], v[3:4], v[5:6] glc
11606 For full list of supported instructions, refer to "FLAT instructions" in ISA
11612 .. code-block:: nasm
11614 buffer_load_dword v1, off, s[4:7], s1
11615 buffer_store_dwordx4 v[1:4], v2, ttmp[4:7], s1 offen offset:4 glc tfe
11616 buffer_store_format_xy v[1:2], off, s[4:7], s1
11618 buffer_atomic_inc v1, v2, s[8:11], s4 idxen offset:4 slc
11620 For full list of supported instructions, refer to "MUBUF Instructions" in ISA
11626 .. code-block:: nasm
11628 s_load_dword s1, s[2:3], 0xfc
11629 s_load_dwordx8 s[8:15], s[2:3], s4
11630 s_load_dwordx16 s[88:103], s[2:3], s4
11634 For full list of supported instructions, refer to "Scalar Memory Operations" in
11640 .. code-block:: nasm
11643 s_mov_b64 s[0:1], 0x80000000
11645 s_wqm_b64 s[2:3], s[4:5]
11646 s_bcnt0_i32_b64 s1, s[2:3]
11647 s_swappc_b64 s[2:3], s[4:5]
11648 s_cbranch_join s[4:5]
11650 For full list of supported instructions, refer to "SOP1 Instructions" in ISA
11656 .. code-block:: nasm
11658 s_add_u32 s1, s2, s3
11659 s_and_b64 s[2:3], s[4:5], s[6:7]
11660 s_cselect_b32 s1, s2, s3
11661 s_andn2_b32 s2, s4, s6
11662 s_lshr_b64 s[2:3], s[4:5], s6
11663 s_ashr_i32 s2, s4, s6
11664 s_bfm_b64 s[2:3], s4, s6
11665 s_bfe_i64 s[2:3], s[4:5], s6
11666 s_cbranch_g_fork s[4:5], s[6:7]
11668 For full list of supported instructions, refer to "SOP2 Instructions" in ISA
11674 .. code-block:: nasm
11676 s_cmp_eq_i32 s1, s2
11677 s_bitcmp1_b32 s1, s2
11678 s_bitcmp0_b64 s[2:3], s4
11681 For full list of supported instructions, refer to "SOPC Instructions" in ISA
11687 .. code-block:: nasm
11692 s_waitcnt 0 ; Wait for all counters to be 0
11693 s_waitcnt vmcnt(0) & expcnt(0) & lgkmcnt(0) ; Equivalent to above
11694 s_waitcnt vmcnt(1) ; Wait for vmcnt counter to be 1.
11698 s_sendmsg sendmsg(MSG_INTERRUPT)
11701 For full list of supported instructions, refer to "SOPP Instructions" in ISA
11704 Unless otherwise mentioned, little verification is performed on the operands
11705 of SOPP Instructions, so it is up to the programmer to be familiar with the
11706 range or acceptable values.
11711 For vector ALU instruction opcodes (VOP1, VOP2, VOP3, VOPC, VOP_DPP, VOP_SDWA),
11712 the assembler will automatically use optimal encoding based on its operands. To
11713 force specific encoding, one can add a suffix to the opcode of the instruction:
11715 * _e32 for 32-bit VOP1/VOP2/VOPC
11716 * _e64 for 64-bit VOP3
11718 * _sdwa for VOP_SDWA
11720 VOP1/VOP2/VOP3/VOPC examples:
11722 .. code-block:: nasm
11725 v_mov_b32_e32 v1, v2
11727 v_cvt_f64_i32_e32 v[1:2], v2
11728 v_floor_f32_e32 v1, v2
11729 v_bfrev_b32_e32 v1, v2
11730 v_add_f32_e32 v1, v2, v3
11731 v_mul_i32_i24_e64 v1, v2, 3
11732 v_mul_i32_i24_e32 v1, -3, v3
11733 v_mul_i32_i24_e32 v1, -100, v3
11734 v_addc_u32 v1, s[0:1], v2, v3, s[2:3]
11735 v_max_f16_e32 v1, v2, v3
11739 .. code-block:: nasm
11741 v_mov_b32 v0, v0 quad_perm:[0,2,1,1]
11742 v_sin_f32 v0, v0 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
11743 v_mov_b32 v0, v0 wave_shl:1
11744 v_mov_b32 v0, v0 row_mirror
11745 v_mov_b32 v0, v0 row_bcast:31
11746 v_mov_b32 v0, v0 quad_perm:[1,3,0,1] row_mask:0xa bank_mask:0x1 bound_ctrl:0
11747 v_add_f32 v0, v0, |v0| row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
11748 v_max_f16 v1, v2, v3 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
11752 .. code-block:: nasm
11754 v_mov_b32 v1, v2 dst_sel:BYTE_0 dst_unused:UNUSED_PRESERVE src0_sel:DWORD
11755 v_min_u32 v200, v200, v1 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_1 src1_sel:DWORD
11756 v_sin_f32 v0, v0 dst_unused:UNUSED_PAD src0_sel:WORD_1
11757 v_fract_f32 v0, |v0| dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1
11758 v_cmpx_le_u32 vcc, v1, v2 src0_sel:BYTE_2 src1_sel:WORD_0
11760 For full list of supported instructions, refer to "Vector ALU instructions".
11762 .. _amdgpu-amdhsa-assembler-predefined-symbols-v2:
11764 Code Object V2 Predefined Symbols
11765 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
11768 Code object V2 is not the default code object version emitted by
11769 this version of LLVM.
11771 The AMDGPU assembler defines and updates some symbols automatically. These
11772 symbols do not affect code generation.
11774 .option.machine_version_major
11775 +++++++++++++++++++++++++++++
11777 Set to the GFX major generation number of the target being assembled for. For
11778 example, when assembling for a "GFX9" target this will be set to the integer
11779 value "9". The possible GFX major generation numbers are presented in
11780 :ref:`amdgpu-processors`.
11782 .option.machine_version_minor
11783 +++++++++++++++++++++++++++++
11785 Set to the GFX minor generation number of the target being assembled for. For
11786 example, when assembling for a "GFX810" target this will be set to the integer
11787 value "1". The possible GFX minor generation numbers are presented in
11788 :ref:`amdgpu-processors`.
11790 .option.machine_version_stepping
11791 ++++++++++++++++++++++++++++++++
11793 Set to the GFX stepping generation number of the target being assembled for.
11794 For example, when assembling for a "GFX704" target this will be set to the
11795 integer value "4". The possible GFX stepping generation numbers are presented
11796 in :ref:`amdgpu-processors`.
11801 Set to zero each time a
11802 :ref:`amdgpu-amdhsa-assembler-directive-amdgpu_hsa_kernel` directive is
11803 encountered. At each instruction, if the current value of this symbol is less
11804 than or equal to the maximum VGPR number explicitly referenced within that
11805 instruction then the symbol value is updated to equal that VGPR number plus
11811 Set to zero each time a
11812 :ref:`amdgpu-amdhsa-assembler-directive-amdgpu_hsa_kernel` directive is
11813 encountered. At each instruction, if the current value of this symbol is less
11814 than or equal to the maximum VGPR number explicitly referenced within that
11815 instruction then the symbol value is updated to equal that SGPR number plus
11818 .. _amdgpu-amdhsa-assembler-directives-v2:
11820 Code Object V2 Directives
11821 ~~~~~~~~~~~~~~~~~~~~~~~~~
11824 Code object V2 is not the default code object version emitted by
11825 this version of LLVM.
11827 AMDGPU ABI defines auxiliary data in output code object. In assembly source,
11828 one can specify them with assembler directives.
11830 .hsa_code_object_version major, minor
11831 +++++++++++++++++++++++++++++++++++++
11833 *major* and *minor* are integers that specify the version of the HSA code
11834 object that will be generated by the assembler.
11836 .hsa_code_object_isa [major, minor, stepping, vendor, arch]
11837 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
11840 *major*, *minor*, and *stepping* are all integers that describe the instruction
11841 set architecture (ISA) version of the assembly program.
11843 *vendor* and *arch* are quoted strings. *vendor* should always be equal to
11844 "AMD" and *arch* should always be equal to "AMDGPU".
11846 By default, the assembler will derive the ISA version, *vendor*, and *arch*
11847 from the value of the -mcpu option that is passed to the assembler.
11849 .. _amdgpu-amdhsa-assembler-directive-amdgpu_hsa_kernel:
11851 .amdgpu_hsa_kernel (name)
11852 +++++++++++++++++++++++++
11854 This directives specifies that the symbol with given name is a kernel entry
11855 point (label) and the object should contain corresponding symbol of type
11856 STT_AMDGPU_HSA_KERNEL.
11861 This directive marks the beginning of a list of key / value pairs that are used
11862 to specify the amd_kernel_code_t object that will be emitted by the assembler.
11863 The list must be terminated by the *.end_amd_kernel_code_t* directive. For any
11864 amd_kernel_code_t values that are unspecified a default value will be used. The
11865 default value for all keys is 0, with the following exceptions:
11867 - *amd_code_version_major* defaults to 1.
11868 - *amd_kernel_code_version_minor* defaults to 2.
11869 - *amd_machine_kind* defaults to 1.
11870 - *amd_machine_version_major*, *machine_version_minor*, and
11871 *amd_machine_version_stepping* are derived from the value of the -mcpu option
11872 that is passed to the assembler.
11873 - *kernel_code_entry_byte_offset* defaults to 256.
11874 - *wavefront_size* defaults 6 for all targets before GFX10. For GFX10 onwards
11875 defaults to 6 if target feature ``wavefrontsize64`` is enabled, otherwise 5.
11876 Note that wavefront size is specified as a power of two, so a value of **n**
11877 means a size of 2^ **n**.
11878 - *call_convention* defaults to -1.
11879 - *kernarg_segment_alignment*, *group_segment_alignment*, and
11880 *private_segment_alignment* default to 4. Note that alignments are specified
11881 as a power of 2, so a value of **n** means an alignment of 2^ **n**.
11882 - *enable_tg_split* defaults to 1 if target feature ``tgsplit`` is enabled for
11884 - *enable_wgp_mode* defaults to 1 if target feature ``cumode`` is disabled for
11886 - *enable_mem_ordered* defaults to 1 for GFX10 onwards.
11888 The *.amd_kernel_code_t* directive must be placed immediately after the
11889 function label and before any instructions.
11891 For a full list of amd_kernel_code_t keys, refer to AMDGPU ABI document,
11892 comments in lib/Target/AMDGPU/AmdKernelCodeT.h and test/CodeGen/AMDGPU/hsa.s.
11894 .. _amdgpu-amdhsa-assembler-example-v2:
11896 Code Object V2 Example Source Code
11897 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
11900 Code Object V2 is not the default code object version emitted by
11901 this version of LLVM.
11903 Here is an example of a minimal assembly source file, defining one HSA kernel:
11908 .hsa_code_object_version 1,0
11909 .hsa_code_object_isa
11914 .amdgpu_hsa_kernel hello_world
11919 enable_sgpr_kernarg_segment_ptr = 1
11921 compute_pgm_rsrc1_vgprs = 0
11922 compute_pgm_rsrc1_sgprs = 0
11923 compute_pgm_rsrc2_user_sgpr = 2
11924 compute_pgm_rsrc1_wgp_mode = 0
11925 compute_pgm_rsrc1_mem_ordered = 0
11926 compute_pgm_rsrc1_fwd_progress = 1
11927 .end_amd_kernel_code_t
11929 s_load_dwordx2 s[0:1], s[0:1] 0x0
11930 v_mov_b32 v0, 3.14159
11931 s_waitcnt lgkmcnt(0)
11934 flat_store_dword v[1:2], v0
11937 .size hello_world, .Lfunc_end0-hello_world
11939 .. _amdgpu-amdhsa-assembler-predefined-symbols-v3-v4:
11941 Code Object V3 to V4 Predefined Symbols
11942 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
11944 The AMDGPU assembler defines and updates some symbols automatically. These
11945 symbols do not affect code generation.
11947 .amdgcn.gfx_generation_number
11948 +++++++++++++++++++++++++++++
11950 Set to the GFX major generation number of the target being assembled for. For
11951 example, when assembling for a "GFX9" target this will be set to the integer
11952 value "9". The possible GFX major generation numbers are presented in
11953 :ref:`amdgpu-processors`.
11955 .amdgcn.gfx_generation_minor
11956 ++++++++++++++++++++++++++++
11958 Set to the GFX minor generation number of the target being assembled for. For
11959 example, when assembling for a "GFX810" target this will be set to the integer
11960 value "1". The possible GFX minor generation numbers are presented in
11961 :ref:`amdgpu-processors`.
11963 .amdgcn.gfx_generation_stepping
11964 +++++++++++++++++++++++++++++++
11966 Set to the GFX stepping generation number of the target being assembled for.
11967 For example, when assembling for a "GFX704" target this will be set to the
11968 integer value "4". The possible GFX stepping generation numbers are presented
11969 in :ref:`amdgpu-processors`.
11971 .. _amdgpu-amdhsa-assembler-symbol-next_free_vgpr:
11973 .amdgcn.next_free_vgpr
11974 ++++++++++++++++++++++
11976 Set to zero before assembly begins. At each instruction, if the current value
11977 of this symbol is less than or equal to the maximum VGPR number explicitly
11978 referenced within that instruction then the symbol value is updated to equal
11979 that VGPR number plus one.
11981 May be used to set the `.amdhsa_next_free_vgpr` directive in
11982 :ref:`amdhsa-kernel-directives-table`.
11984 May be set at any time, e.g. manually set to zero at the start of each kernel.
11986 .. _amdgpu-amdhsa-assembler-symbol-next_free_sgpr:
11988 .amdgcn.next_free_sgpr
11989 ++++++++++++++++++++++
11991 Set to zero before assembly begins. At each instruction, if the current value
11992 of this symbol is less than or equal the maximum SGPR number explicitly
11993 referenced within that instruction then the symbol value is updated to equal
11994 that SGPR number plus one.
11996 May be used to set the `.amdhsa_next_free_spgr` directive in
11997 :ref:`amdhsa-kernel-directives-table`.
11999 May be set at any time, e.g. manually set to zero at the start of each kernel.
12001 .. _amdgpu-amdhsa-assembler-directives-v3-v4:
12003 Code Object V3 to V4 Directives
12004 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
12006 Directives which begin with ``.amdgcn`` are valid for all ``amdgcn``
12007 architecture processors, and are not OS-specific. Directives which begin with
12008 ``.amdhsa`` are specific to ``amdgcn`` architecture processors when the
12009 ``amdhsa`` OS is specified. See :ref:`amdgpu-target-triples` and
12010 :ref:`amdgpu-processors`.
12012 .. _amdgpu-assembler-directive-amdgcn-target:
12014 .amdgcn_target <target-triple> "-" <target-id>
12015 ++++++++++++++++++++++++++++++++++++++++++++++
12017 Optional directive which declares the ``<target-triple>-<target-id>`` supported
12018 by the containing assembler source file. Used by the assembler to validate
12019 command-line options such as ``-triple``, ``-mcpu``, and
12020 ``--offload-arch=<target-id>``. A non-canonical target ID is allowed. See
12021 :ref:`amdgpu-target-triples` and :ref:`amdgpu-target-id`.
12025 The target ID syntax used for code object V2 to V3 for this directive differs
12026 from that used elsewhere. See :ref:`amdgpu-target-id-v2-v3`.
12028 .amdhsa_kernel <name>
12029 +++++++++++++++++++++
12031 Creates a correctly aligned AMDHSA kernel descriptor and a symbol,
12032 ``<name>.kd``, in the current location of the current section. Only valid when
12033 the OS is ``amdhsa``. ``<name>`` must be a symbol that labels the first
12034 instruction to execute, and does not need to be previously defined.
12036 Marks the beginning of a list of directives used to generate the bytes of a
12037 kernel descriptor, as described in :ref:`amdgpu-amdhsa-kernel-descriptor`.
12038 Directives which may appear in this list are described in
12039 :ref:`amdhsa-kernel-directives-table`. Directives may appear in any order, must
12040 be valid for the target being assembled for, and cannot be repeated. Directives
12041 support the range of values specified by the field they reference in
12042 :ref:`amdgpu-amdhsa-kernel-descriptor`. If a directive is not specified, it is
12043 assumed to have its default value, unless it is marked as "Required", in which
12044 case it is an error to omit the directive. This list of directives is
12045 terminated by an ``.end_amdhsa_kernel`` directive.
12047 .. table:: AMDHSA Kernel Assembler Directives
12048 :name: amdhsa-kernel-directives-table
12050 ======================================================== =================== ============ ===================
12051 Directive Default Supported On Description
12052 ======================================================== =================== ============ ===================
12053 ``.amdhsa_group_segment_fixed_size`` 0 GFX6-GFX10 Controls GROUP_SEGMENT_FIXED_SIZE in
12054 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
12055 ``.amdhsa_private_segment_fixed_size`` 0 GFX6-GFX10 Controls PRIVATE_SEGMENT_FIXED_SIZE in
12056 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
12057 ``.amdhsa_kernarg_size`` 0 GFX6-GFX10 Controls KERNARG_SIZE in
12058 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
12059 ``.amdhsa_user_sgpr_private_segment_buffer`` 0 GFX6-GFX10 Controls ENABLE_SGPR_PRIVATE_SEGMENT_BUFFER in
12060 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
12061 ``.amdhsa_user_sgpr_dispatch_ptr`` 0 GFX6-GFX10 Controls ENABLE_SGPR_DISPATCH_PTR in
12062 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
12063 ``.amdhsa_user_sgpr_queue_ptr`` 0 GFX6-GFX10 Controls ENABLE_SGPR_QUEUE_PTR in
12064 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
12065 ``.amdhsa_user_sgpr_kernarg_segment_ptr`` 0 GFX6-GFX10 Controls ENABLE_SGPR_KERNARG_SEGMENT_PTR in
12066 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
12067 ``.amdhsa_user_sgpr_dispatch_id`` 0 GFX6-GFX10 Controls ENABLE_SGPR_DISPATCH_ID in
12068 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
12069 ``.amdhsa_user_sgpr_flat_scratch_init`` 0 GFX6-GFX10 Controls ENABLE_SGPR_FLAT_SCRATCH_INIT in
12070 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
12071 ``.amdhsa_user_sgpr_private_segment_size`` 0 GFX6-GFX10 Controls ENABLE_SGPR_PRIVATE_SEGMENT_SIZE in
12072 :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
12073 ``.amdhsa_wavefront_size32`` Target GFX10 Controls ENABLE_WAVEFRONT_SIZE32 in
12074 Feature :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
12077 ``.amdhsa_system_sgpr_private_segment_wavefront_offset`` 0 GFX6-GFX10 Controls ENABLE_PRIVATE_SEGMENT in
12078 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`.
12079 ``.amdhsa_system_sgpr_workgroup_id_x`` 1 GFX6-GFX10 Controls ENABLE_SGPR_WORKGROUP_ID_X in
12080 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`.
12081 ``.amdhsa_system_sgpr_workgroup_id_y`` 0 GFX6-GFX10 Controls ENABLE_SGPR_WORKGROUP_ID_Y in
12082 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`.
12083 ``.amdhsa_system_sgpr_workgroup_id_z`` 0 GFX6-GFX10 Controls ENABLE_SGPR_WORKGROUP_ID_Z in
12084 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`.
12085 ``.amdhsa_system_sgpr_workgroup_info`` 0 GFX6-GFX10 Controls ENABLE_SGPR_WORKGROUP_INFO in
12086 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`.
12087 ``.amdhsa_system_vgpr_workitem_id`` 0 GFX6-GFX10 Controls ENABLE_VGPR_WORKITEM_ID in
12088 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`.
12089 Possible values are defined in
12090 :ref:`amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table`.
12091 ``.amdhsa_next_free_vgpr`` Required GFX6-GFX10 Maximum VGPR number explicitly referenced, plus one.
12092 Used to calculate GRANULATED_WORKITEM_VGPR_COUNT in
12093 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
12094 ``.amdhsa_next_free_sgpr`` Required GFX6-GFX10 Maximum SGPR number explicitly referenced, plus one.
12095 Used to calculate GRANULATED_WAVEFRONT_SGPR_COUNT in
12096 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
12097 ``.amdhsa_accum_offset`` Required GFX90A Offset of a first AccVGPR in the unified register file.
12098 Used to calculate ACCUM_OFFSET in
12099 :ref:`amdgpu-amdhsa-compute_pgm_rsrc3-gfx90a-table`.
12100 ``.amdhsa_reserve_vcc`` 1 GFX6-GFX10 Whether the kernel may use the special VCC SGPR.
12101 Used to calculate GRANULATED_WAVEFRONT_SGPR_COUNT in
12102 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
12103 ``.amdhsa_reserve_flat_scratch`` 1 GFX7-GFX10 Whether the kernel may use flat instructions to access
12104 scratch memory. Used to calculate
12105 GRANULATED_WAVEFRONT_SGPR_COUNT in
12106 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
12107 ``.amdhsa_reserve_xnack_mask`` Target GFX8-GFX10 Whether the kernel may trigger XNACK replay.
12108 Feature Used to calculate GRANULATED_WAVEFRONT_SGPR_COUNT in
12109 Specific :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
12111 ``.amdhsa_float_round_mode_32`` 0 GFX6-GFX10 Controls FLOAT_ROUND_MODE_32 in
12112 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
12113 Possible values are defined in
12114 :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
12115 ``.amdhsa_float_round_mode_16_64`` 0 GFX6-GFX10 Controls FLOAT_ROUND_MODE_16_64 in
12116 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
12117 Possible values are defined in
12118 :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
12119 ``.amdhsa_float_denorm_mode_32`` 0 GFX6-GFX10 Controls FLOAT_DENORM_MODE_32 in
12120 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
12121 Possible values are defined in
12122 :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
12123 ``.amdhsa_float_denorm_mode_16_64`` 3 GFX6-GFX10 Controls FLOAT_DENORM_MODE_16_64 in
12124 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
12125 Possible values are defined in
12126 :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
12127 ``.amdhsa_dx10_clamp`` 1 GFX6-GFX10 Controls ENABLE_DX10_CLAMP in
12128 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
12129 ``.amdhsa_ieee_mode`` 1 GFX6-GFX10 Controls ENABLE_IEEE_MODE in
12130 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
12131 ``.amdhsa_fp16_overflow`` 0 GFX9-GFX10 Controls FP16_OVFL in
12132 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
12133 ``.amdhsa_tg_split`` Target GFX90A Controls TG_SPLIT in
12134 Feature :ref:`amdgpu-amdhsa-compute_pgm_rsrc3-gfx90a-table`.
12137 ``.amdhsa_workgroup_processor_mode`` Target GFX10 Controls ENABLE_WGP_MODE in
12138 Feature :ref:`amdgpu-amdhsa-kernel-descriptor-v3-table`.
12141 ``.amdhsa_memory_ordered`` 1 GFX10 Controls MEM_ORDERED in
12142 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
12143 ``.amdhsa_forward_progress`` 0 GFX10 Controls FWD_PROGRESS in
12144 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
12145 ``.amdhsa_exception_fp_ieee_invalid_op`` 0 GFX6-GFX10 Controls ENABLE_EXCEPTION_IEEE_754_FP_INVALID_OPERATION in
12146 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`.
12147 ``.amdhsa_exception_fp_denorm_src`` 0 GFX6-GFX10 Controls ENABLE_EXCEPTION_FP_DENORMAL_SOURCE in
12148 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`.
12149 ``.amdhsa_exception_fp_ieee_div_zero`` 0 GFX6-GFX10 Controls ENABLE_EXCEPTION_IEEE_754_FP_DIVISION_BY_ZERO in
12150 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`.
12151 ``.amdhsa_exception_fp_ieee_overflow`` 0 GFX6-GFX10 Controls ENABLE_EXCEPTION_IEEE_754_FP_OVERFLOW in
12152 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`.
12153 ``.amdhsa_exception_fp_ieee_underflow`` 0 GFX6-GFX10 Controls ENABLE_EXCEPTION_IEEE_754_FP_UNDERFLOW in
12154 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`.
12155 ``.amdhsa_exception_fp_ieee_inexact`` 0 GFX6-GFX10 Controls ENABLE_EXCEPTION_IEEE_754_FP_INEXACT in
12156 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`.
12157 ``.amdhsa_exception_int_div_zero`` 0 GFX6-GFX10 Controls ENABLE_EXCEPTION_INT_DIVIDE_BY_ZERO in
12158 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`.
12159 ======================================================== =================== ============ ===================
12164 Optional directive which declares the contents of the ``NT_AMDGPU_METADATA``
12165 note record (see :ref:`amdgpu-elf-note-records-table-v3-v4`).
12167 The contents must be in the [YAML]_ markup format, with the same structure and
12168 semantics described in :ref:`amdgpu-amdhsa-code-object-metadata-v3` or
12169 :ref:`amdgpu-amdhsa-code-object-metadata-v4`.
12171 This directive is terminated by an ``.end_amdgpu_metadata`` directive.
12173 .. _amdgpu-amdhsa-assembler-example-v3-v4:
12175 Code Object V3 to V4 Example Source Code
12176 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
12178 Here is an example of a minimal assembly source file, defining one HSA kernel:
12183 .amdgcn_target "amdgcn-amd-amdhsa--gfx900+xnack" // optional
12188 .type hello_world,@function
12190 s_load_dwordx2 s[0:1], s[0:1] 0x0
12191 v_mov_b32 v0, 3.14159
12192 s_waitcnt lgkmcnt(0)
12195 flat_store_dword v[1:2], v0
12198 .size hello_world, .Lfunc_end0-hello_world
12202 .amdhsa_kernel hello_world
12203 .amdhsa_user_sgpr_kernarg_segment_ptr 1
12204 .amdhsa_next_free_vgpr .amdgcn.next_free_vgpr
12205 .amdhsa_next_free_sgpr .amdgcn.next_free_sgpr
12214 - .name: hello_world
12215 .symbol: hello_world.kd
12216 .kernarg_segment_size: 48
12217 .group_segment_fixed_size: 0
12218 .private_segment_fixed_size: 0
12219 .kernarg_segment_align: 4
12220 .wavefront_size: 64
12223 .max_flat_workgroup_size: 256
12227 .value_kind: global_buffer
12228 .address_space: global
12229 .actual_access: write_only
12231 .end_amdgpu_metadata
12233 This kernel is equivalent to the following HIP program:
12238 __global__ void hello_world(float *p) {
12242 If an assembly source file contains multiple kernels and/or functions, the
12243 :ref:`amdgpu-amdhsa-assembler-symbol-next_free_vgpr` and
12244 :ref:`amdgpu-amdhsa-assembler-symbol-next_free_sgpr` symbols may be reset using
12245 the ``.set <symbol>, <expression>`` directive. For example, in the case of two
12246 kernels, where ``function1`` is only called from ``kernel1`` it is sufficient
12247 to group the function with the kernel that calls it and reset the symbols
12248 between the two connected components:
12253 .amdgcn_target "amdgcn-amd-amdhsa--gfx900+xnack" // optional
12255 // gpr tracking symbols are implicitly set to zero
12260 .type kern0,@function
12265 .size kern0, .Lkern0_end-kern0
12269 .amdhsa_kernel kern0
12271 .amdhsa_next_free_vgpr .amdgcn.next_free_vgpr
12272 .amdhsa_next_free_sgpr .amdgcn.next_free_sgpr
12275 // reset symbols to begin tracking usage in func1 and kern1
12276 .set .amdgcn.next_free_vgpr, 0
12277 .set .amdgcn.next_free_sgpr, 0
12283 .type func1,@function
12286 s_setpc_b64 s[30:31]
12288 .size func1, .Lfunc1_end-func1
12292 .type kern1,@function
12296 s_add_u32 s4, s4, func1@rel32@lo+4
12297 s_addc_u32 s5, s5, func1@rel32@lo+4
12298 s_swappc_b64 s[30:31], s[4:5]
12302 .size kern1, .Lkern1_end-kern1
12306 .amdhsa_kernel kern1
12308 .amdhsa_next_free_vgpr .amdgcn.next_free_vgpr
12309 .amdhsa_next_free_sgpr .amdgcn.next_free_sgpr
12312 These symbols cannot identify connected components in order to automatically
12313 track the usage for each kernel. However, in some cases careful organization of
12314 the kernels and functions in the source file means there is minimal additional
12315 effort required to accurately calculate GPR usage.
12317 Additional Documentation
12318 ========================
12320 .. [AMD-GCN-GFX6] `AMD Southern Islands Series ISA <http://developer.amd.com/wordpress/media/2012/12/AMD_Southern_Islands_Instruction_Set_Architecture.pdf>`__
12321 .. [AMD-GCN-GFX7] `AMD Sea Islands Series ISA <http://developer.amd.com/wordpress/media/2013/07/AMD_Sea_Islands_Instruction_Set_Architecture.pdf>`_
12322 .. [AMD-GCN-GFX8] `AMD GCN3 Instruction Set Architecture <http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/12/AMD_GCN3_Instruction_Set_Architecture_rev1.1.pdf>`__
12323 .. [AMD-GCN-GFX900-GFX904-VEGA] `AMD Vega Instruction Set Architecture <http://developer.amd.com/wordpress/media/2013/12/Vega_Shader_ISA_28July2017.pdf>`__
12324 .. [AMD-GCN-GFX906-VEGA7NM] `AMD Vega 7nm Instruction Set Architecture <https://gpuopen.com/wp-content/uploads/2019/11/Vega_7nm_Shader_ISA_26November2019.pdf>`__
12325 .. [AMD-GCN-GFX908-CDNA1] `AMD Instinct MI100 Instruction Set Architecture <https://developer.amd.com/wp-content/resources/CDNA1_Shader_ISA_14December2020.pdf>`__
12326 .. [AMD-GCN-GFX10-RDNA1] `AMD RDNA 1.0 Instruction Set Architecture <https://gpuopen.com/wp-content/uploads/2019/08/RDNA_Shader_ISA_5August2019.pdf>`__
12327 .. [AMD-GCN-GFX10-RDNA2] `AMD RDNA 2 Instruction Set Architecture <https://developer.amd.com/wp-content/resources/RDNA2_Shader_ISA_November2020.pdf>`__
12328 .. [AMD-RADEON-HD-2000-3000] `AMD R6xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R600_Instruction_Set_Architecture.pdf>`__
12329 .. [AMD-RADEON-HD-4000] `AMD R7xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R700-Family_Instruction_Set_Architecture.pdf>`__
12330 .. [AMD-RADEON-HD-5000] `AMD Evergreen shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_Evergreen-Family_Instruction_Set_Architecture.pdf>`__
12331 .. [AMD-RADEON-HD-6000] `AMD Cayman/Trinity shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_HD_6900_Series_Instruction_Set_Architecture.pdf>`__
12332 .. [AMD-ROCm] `AMD ROCm™ Platform <https://rocmdocs.amd.com/>`__
12333 .. [AMD-ROCm-github] `AMD ROCm™ github <http://github.com/RadeonOpenCompute>`__
12334 .. [AMD-ROCm-Release-Notes] `AMD ROCm Release Notes <https://github.com/RadeonOpenCompute/ROCm>`__
12335 .. [CLANG-ATTR] `Attributes in Clang <https://clang.llvm.org/docs/AttributeReference.html>`__
12336 .. [DWARF] `DWARF Debugging Information Format <http://dwarfstd.org/>`__
12337 .. [ELF] `Executable and Linkable Format (ELF) <http://www.sco.com/developers/gabi/>`__
12338 .. [HRF] `Heterogeneous-race-free Memory Models <http://benedictgaster.org/wp-content/uploads/2014/01/asplos269-FINAL.pdf>`__
12339 .. [HSA] `Heterogeneous System Architecture (HSA) Foundation <http://www.hsafoundation.com/>`__
12340 .. [MsgPack] `Message Pack <http://www.msgpack.org/>`__
12341 .. [OpenCL] `The OpenCL Specification Version 2.0 <http://www.khronos.org/registry/cl/specs/opencl-2.0.pdf>`__
12342 .. [SEMVER] `Semantic Versioning <https://semver.org/>`__
12343 .. [YAML] `YAML Ain't Markup Language (YAML™) Version 1.2 <http://www.yaml.org/spec/1.2/spec.html>`__