1 =============================
2 User Guide for AMDGPU Backend
3 =============================
11 The AMDGPU backend provides ISA code generation for AMD GPUs, starting with the
12 R600 family up until the current GCN families. It lives in the
13 ``lib/Target/AMDGPU`` directory.
18 .. _amdgpu-target-triples:
23 Use the ``clang -target <Architecture>-<Vendor>-<OS>-<Environment>`` option to
24 specify the target triple:
26 .. table:: AMDGPU Target Triples
27 :name: amdgpu-target-triples-table
29 ============ ======== ========= ===========
30 Architecture Vendor OS Environment
31 ============ ======== ========= ===========
32 r600 amd <empty> <empty>
33 amdgcn amd <empty> <empty>
34 amdgcn amd amdhsa <empty>
35 amdgcn amd amdhsa opencl
36 amdgcn amd amdhsa amdgizcl
37 amdgcn amd amdhsa amdgiz
39 ============ ======== ========= ===========
42 Supports AMD GPUs HD2XXX-HD6XXX for graphics and compute shaders executed on
46 Supports AMD GPUs GCN GFX6 onwards for graphics and compute shaders executed on
49 ``amdgcn-amd-amdhsa-``
50 Supports AMD GCN GPUs GFX6 onwards for compute kernels executed on HSA [HSA]_
51 compatible runtimes such as AMD's ROCm [AMD-ROCm]_.
53 ``amdgcn-amd-amdhsa-opencl``
54 Supports AMD GCN GPUs GFX6 onwards for OpenCL compute kernels executed on HSA
55 [HSA]_ compatible runtimes such as AMD's ROCm [AMD-ROCm]_. See
58 ``amdgcn-amd-amdhsa-amdgizcl``
59 Same as ``amdgcn-amd-amdhsa-opencl`` except a different address space mapping
60 is used (see :ref:`amdgpu-address-spaces`).
62 ``amdgcn-amd-amdhsa-amdgiz``
63 Same as ``amdgcn-amd-amdhsa-`` except a different address space mapping is
64 used (see :ref:`amdgpu-address-spaces`).
66 ``amdgcn-amd-amdhsa-hcc``
67 Supports AMD GCN GPUs GFX6 onwards for AMD HC language compute kernels
68 executed on HSA [HSA]_ compatible runtimes such as AMD's ROCm [AMD-ROCm]_. See
71 .. _amdgpu-processors:
76 Use the ``clang -mcpu <Processor>`` option to specify the AMD GPU processor. The
77 names from both the *Processor* and *Alternative Processor* can be used.
79 .. table:: AMDGPU Processors
80 :name: amdgpu-processors-table
82 ========== =========== ============ ===== ======= ==================
83 Processor Alternative Target dGPU/ Runtime Example
84 Processor Triple APU Support Products
86 ========== =========== ============ ===== ======= ==================
88 --------------------------------------------------------------------
94 --------------------------------------------------------------------
98 **Evergreen** [AMD-Evergreen]_
99 --------------------------------------------------------------------
105 **Northern Islands** [AMD-Cayman-Trinity]_
106 --------------------------------------------------------------------
111 **GCN GFX6 (Southern Islands (SI))** [AMD-Souther-Islands]_
112 --------------------------------------------------------------------
113 gfx600 - tahiti amdgcn dGPU
114 gfx601 - pitcairn amdgcn dGPU
118 **GCN GFX7 (Sea Islands (CI))** [AMD-Sea-Islands]_
119 --------------------------------------------------------------------
120 gfx700 - bonaire amdgcn dGPU - Radeon HD 7790
124 \ - kaveri amdgcn APU - A6-7000
134 gfx701 - hawaii amdgcn dGPU ROCm - FirePro W8100
138 gfx702 dGPU ROCm - Radeon R9 290
142 gfx703 - kabini amdgcn APU - E1-2100
151 **GCN GFX8 (Volcanic Islands (VI))** [AMD-Volcanic-Islands]_
152 --------------------------------------------------------------------
153 gfx800 - iceland amdgcn dGPU - FirePro S7150
161 gfx801 - carrizo amdgcn APU - A6-8500P
167 \ amdgcn APU ROCm - A10-8700P
170 \ amdgcn APU - A10-9600P
176 \ amdgcn APU - E2-9010
179 gfx802 - tonga amdgcn dGPU ROCm Same as gfx800
180 gfx803 - fiji amdgcn dGPU ROCm - Radeon R9 Nano
185 - Radeon Instinct MI8
186 \ - polaris10 amdgcn dGPU ROCm - Radeon RX 470
188 - Radeon Instinct MI6
189 \ - polaris11 amdgcn dGPU ROCm - Radeon RX 460
190 gfx804 amdgcn dGPU Same as gfx803
191 gfx810 - stoney amdgcn APU
192 **GCN GFX9** [AMD-Vega]_
193 --------------------------------------------------------------------
194 gfx900 amdgcn dGPU - Radeon Vega
200 - Radeon Instinct MI25
201 gfx901 amdgcn dGPU ROCm Same as gfx900
204 gfx902 amdgcn APU *TBA*
209 gfx903 amdgcn APU Same as gfx902
212 ========== =========== ============ ===== ======= ==================
214 .. _amdgpu-address-spaces:
219 The AMDGPU backend uses the following address space mappings.
221 The memory space names used in the table, aside from the region memory space, is
222 from the OpenCL standard.
224 LLVM Address Space number is used throughout LLVM (for example, in LLVM IR).
226 .. table:: Address Space Mapping
227 :name: amdgpu-address-space-mapping-table
229 ================== ================= ================= ================= =================
230 LLVM Address Space Memory Space
231 ------------------ -----------------------------------------------------------------------
232 \ Current Default amdgiz/amdgizcl hcc Future Default
233 ================== ================= ================= ================= =================
234 0 Private (Scratch) Generic (Flat) Generic (Flat) Generic (Flat)
235 1 Global Global Global Global
236 2 Constant Constant Constant Region (GDS)
237 3 Local (group/LDS) Local (group/LDS) Local (group/LDS) Local (group/LDS)
238 4 Generic (Flat) Region (GDS) Region (GDS) Constant
239 5 Region (GDS) Private (Scratch) Private (Scratch) Private (Scratch)
240 ================== ================= ================= ================= =================
243 This is the current default address space mapping used for all languages
244 except hcc. This will shortly be deprecated.
247 This is the current address space mapping used when ``amdgiz`` or ``amdgizcl``
248 is specified as the target triple environment value.
251 This is the current address space mapping used when ``hcc`` is specified as
252 the target triple environment value.This will shortly be deprecated.
255 This will shortly be the only address space mapping for all languages using
258 .. _amdgpu-memory-scopes:
263 This section provides LLVM memory synchronization scopes supported by the AMDGPU
264 backend memory model when the target triple OS is ``amdhsa`` (see
265 :ref:`amdgpu-amdhsa-memory-model` and :ref:`amdgpu-target-triples`).
267 The memory model supported is based on the HSA memory model [HSA]_ which is
268 based in turn on HRF-indirect with scope inclusion [HRF]_. The happens-before
269 relation is transitive over the synchonizes-with relation independent of scope,
270 and synchonizes-with allows the memory scope instances to be inclusive (see
271 table :ref:`amdgpu-amdhsa-llvm-sync-scopes-amdhsa-table`).
273 This is different to the OpenCL [OpenCL]_ memory model which does not have scope
274 inclusion and requires the memory scopes to exactly match. However, this
275 is conservatively correct for OpenCL.
277 .. table:: AMDHSA LLVM Sync Scopes for AMDHSA
278 :name: amdgpu-amdhsa-llvm-sync-scopes-amdhsa-table
280 ================ ==========================================================
281 LLVM Sync Scope Description
282 ================ ==========================================================
283 *none* The default: ``system``.
285 Synchronizes with, and participates in modification and
286 seq_cst total orderings with, other operations (except
287 image operations) for all address spaces (except private,
288 or generic that accesses private) provided the other
289 operation's sync scope is:
292 - ``agent`` and executed by a thread on the same agent.
293 - ``workgroup`` and executed by a thread in the same
295 - ``wavefront`` and executed by a thread in the same
298 ``agent`` Synchronizes with, and participates in modification and
299 seq_cst total orderings with, other operations (except
300 image operations) for all address spaces (except private,
301 or generic that accesses private) provided the other
302 operation's sync scope is:
304 - ``system`` or ``agent`` and executed by a thread on the
306 - ``workgroup`` and executed by a thread in the same
308 - ``wavefront`` and executed by a thread in the same
311 ``workgroup`` Synchronizes with, and participates in modification and
312 seq_cst total orderings with, other operations (except
313 image operations) for all address spaces (except private,
314 or generic that accesses private) provided the other
315 operation's sync scope is:
317 - ``system``, ``agent`` or ``workgroup`` and executed by a
318 thread in the same workgroup.
319 - ``wavefront`` and executed by a thread in the same
322 ``wavefront`` Synchronizes with, and participates in modification and
323 seq_cst total orderings with, other operations (except
324 image operations) for all address spaces (except private,
325 or generic that accesses private) provided the other
326 operation's sync scope is:
328 - ``system``, ``agent``, ``workgroup`` or ``wavefront``
329 and executed by a thread in the same wavefront.
331 ``singlethread`` Only synchronizes with, and participates in modification
332 and seq_cst total orderings with, other operations (except
333 image operations) running in the same thread for all
334 address spaces (for example, in signal handlers).
335 ================ ==========================================================
340 The AMDGPU backend implements the following intrinsics.
342 *This section is WIP.*
345 List AMDGPU intrinsics
350 The AMDGPU backend generates a standard ELF [ELF]_ relocatable code object that
351 can be linked by ``lld`` to produce a standard ELF shared code object which can
352 be loaded and executed on an AMDGPU target.
357 The AMDGPU backend uses the following ELF header:
359 .. table:: AMDGPU ELF Header
360 :name: amdgpu-elf-header-table
362 ========================== =========================
364 ========================== =========================
365 ``e_ident[EI_CLASS]`` ``ELFCLASS64``
366 ``e_ident[EI_DATA]`` ``ELFDATA2LSB``
367 ``e_ident[EI_OSABI]`` ``ELFOSABI_AMDGPU_HSA``
368 ``e_ident[EI_ABIVERSION]`` ``ELFABIVERSION_AMDGPU_HSA``
369 ``e_type`` ``ET_REL`` or ``ET_DYN``
370 ``e_machine`` ``EM_AMDGPU``
373 ========================== =========================
377 .. table:: AMDGPU ELF Header Enumeration Values
378 :name: amdgpu-elf-header-enumeration-values-table
380 ============================ =====
382 ============================ =====
384 ``ELFOSABI_AMDGPU_HSA`` 64
385 ``ELFABIVERSION_AMDGPU_HSA`` 1
386 ============================ =====
388 ``e_ident[EI_CLASS]``
389 The ELF class is always ``ELFCLASS64``. The AMDGPU backend only supports 64 bit
393 All AMDGPU targets use ELFDATA2LSB for little-endian byte ordering.
395 ``e_ident[EI_OSABI]``
396 The AMD GPU architecture specific OS ABI of ``ELFOSABI_AMDGPU_HSA`` is used to
397 specify that the code object conforms to the AMD HSA runtime ABI [HSA]_.
399 ``e_ident[EI_ABIVERSION]``
400 The AMD GPU architecture specific OS ABI version of
401 ``ELFABIVERSION_AMDGPU_HSA`` is used to specify the version of AMD HSA runtime
402 ABI to which the code object conforms.
405 Can be one of the following values:
409 The type produced by the AMD GPU backend compiler as it is relocatable code
413 The type produced by the linker as it is a shared code object.
415 The AMD HSA runtime loader requires a ``ET_DYN`` code object.
418 The value ``EM_AMDGPU`` is used for the machine for all members of the AMD GPU
419 architecture family. The specific member is specified in the
420 ``NT_AMD_AMDGPU_ISA`` entry in the ``.note`` section (see
421 :ref:`amdgpu-note-records`).
424 The entry point is 0 as the entry points for individual kernels must be
425 selected in order to invoke them through AQL packets.
428 The value is 0 as no flags are used.
433 An AMDGPU target ELF code object has the standard ELF sections which include:
435 .. table:: AMDGPU ELF Sections
436 :name: amdgpu-elf-sections-table
438 ================== ================ =================================
440 ================== ================ =================================
441 ``.bss`` ``SHT_NOBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
442 ``.data`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
443 ``.debug_``\ *\** ``SHT_PROGBITS`` *none*
444 ``.dynamic`` ``SHT_DYNAMIC`` ``SHF_ALLOC``
445 ``.dynstr`` ``SHT_PROGBITS`` ``SHF_ALLOC``
446 ``.dynsym`` ``SHT_PROGBITS`` ``SHF_ALLOC``
447 ``.got`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
448 ``.hash`` ``SHT_HASH`` ``SHF_ALLOC``
449 ``.note`` ``SHT_NOTE`` *none*
450 ``.rela``\ *name* ``SHT_RELA`` *none*
451 ``.rela.dyn`` ``SHT_RELA`` *none*
452 ``.rodata`` ``SHT_PROGBITS`` ``SHF_ALLOC``
453 ``.shstrtab`` ``SHT_STRTAB`` *none*
454 ``.strtab`` ``SHT_STRTAB`` *none*
455 ``.symtab`` ``SHT_SYMTAB`` *none*
456 ``.text`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_EXECINSTR``
457 ================== ================ =================================
459 These sections have their standard meanings (see [ELF]_) and are only generated
463 The standard DWARF sections. See :ref:`amdgpu-dwarf` for information on the
464 DWARF produced by the AMDGPU backend.
466 ``.dynamic``, ``.dynstr``, ``.dynsym``, ``.hash``
467 The standard sections used by a dynamic loader.
470 See :ref:`amdgpu-note-records` for the note records supported by the AMDGPU
473 ``.rela``\ *name*, ``.rela.dyn``
474 For relocatable code objects, *name* is the name of the section that the
475 relocation records apply. For example, ``.rela.text`` is the section name for
476 relocation records associated with the ``.text`` section.
478 For linked shared code objects, ``.rela.dyn`` contains all the relocation
479 records from each of the relocatable code object's ``.rela``\ *name* sections.
481 See :ref:`amdgpu-relocation-records` for the relocation records supported by
485 The executable machine code for the kernels and functions they call. Generated
486 as position independent code. See :ref:`amdgpu-code-conventions` for
487 information on conventions used in the isa generation.
489 .. _amdgpu-note-records:
494 As required by ``ELFCLASS64``, minimal zero byte padding must be generated after
495 the ``name`` field to ensure the ``desc`` field is 4 byte aligned. In addition,
496 minimal zero byte padding must be generated to ensure the ``desc`` field size is
497 a multiple of 4 bytes. The ``sh_addralign`` field of the ``.note`` section must
498 be at least 4 to indicate at least 8 byte alignment.
500 The AMDGPU backend code object uses the following ELF note records in the
501 ``.note`` section. The *Description* column specifies the layout of the note
502 record’s ``desc`` field. All fields are consecutive bytes. Note records with
503 variable size strings have a corresponding ``*_size`` field that specifies the
504 number of bytes, including the terminating null character, in the string. The
505 string(s) come immediately after the preceding fields.
507 Additional note records can be present.
509 .. table:: AMDGPU ELF Note Records
510 :name: amdgpu-elf-note-records-table
512 ===== ============================== ======================================
513 Name Type Description
514 ===== ============================== ======================================
515 "AMD" ``NT_AMD_AMDGPU_HSA_METADATA`` <metadata null terminated string>
516 "AMD" ``NT_AMD_AMDGPU_ISA`` <isa name null terminated string>
517 ===== ============================== ======================================
521 .. table:: AMDGPU ELF Note Record Enumeration Values
522 :name: amdgpu-elf-note-record-enumeration-values-table
524 ============================== =====
526 ============================== =====
528 ``NT_AMD_AMDGPU_HSA_METADATA`` 10
529 ``NT_AMD_AMDGPU_ISA`` 11
530 ============================== =====
532 ``NT_AMD_AMDGPU_ISA``
533 Specifies the instruction set architecture used by the machine code contained
536 This note record is required for code objects containing machine code for
537 processors matching the ``amdgcn`` architecture in table
538 :ref:`amdgpu-processors`.
540 The null terminated string has the following syntax:
542 *architecture*\ ``-``\ *vendor*\ ``-``\ *os*\ ``-``\ *environment*\ ``-``\ *processor*
547 The architecture from table :ref:`amdgpu-target-triples-table`.
549 This is always ``amdgcn`` when the target triple OS is ``amdhsa`` (see
550 :ref:`amdgpu-target-triples`).
553 The vendor from table :ref:`amdgpu-target-triples-table`.
555 For the AMDGPU backend this is always ``amd``.
558 The OS from table :ref:`amdgpu-target-triples-table`.
561 An environment from table :ref:`amdgpu-target-triples-table`, or blank if
562 the environment has no affect on the execution of the code object.
564 For the AMDGPU backend this is currently always blank.
566 The processor from table :ref:`amdgpu-processors-table`.
570 ``amdgcn-amd-amdhsa--gfx901``
572 ``NT_AMD_AMDGPU_HSA_METADATA``
573 Specifies extensible metadata associated with the code objects executed on HSA
574 [HSA]_ compatible runtimes such as AMD's ROCm [AMD-ROCm]_. It is required when
575 the target triple OS is ``amdhsa`` (see :ref:`amdgpu-target-triples`). See
576 :ref:`amdgpu-amdhsa-hsa-code-object-metadata` for the syntax of the code
577 object metadata string.
584 Symbols include the following:
586 .. table:: AMDGPU ELF Symbols
587 :name: amdgpu-elf-symbols-table
589 ===================== ============== ============= ==================
590 Name Type Section Description
591 ===================== ============== ============= ==================
592 *link-name* ``STT_OBJECT`` - ``.data`` Global variable
595 *link-name*\ ``@kd`` ``STT_OBJECT`` - ``.rodata`` Kernel descriptor
596 *link-name* ``STT_FUNC`` - ``.text`` Kernel entry point
597 ===================== ============== ============= ==================
600 Global variables both used and defined by the compilation unit.
602 If the symbol is defined in the compilation unit then it is allocated in the
603 appropriate section according to if it has initialized data or is readonly.
605 If the symbol is external then its section is ``STN_UNDEF`` and the loader
606 will resolve relocations using the definition provided by another code object
607 or explicitly defined by the runtime.
609 All global symbols, whether defined in the compilation unit or external, are
610 accessed by the machine code indirectly through a GOT table entry. This
611 allows them to be preemptable. The GOT table is only supported when the target
612 triple OS is ``amdhsa`` (see :ref:`amdgpu-target-triples`).
615 Add description of linked shared object symbols. Seems undefined symbols
616 are marked as STT_NOTYPE.
619 Every HSA kernel has an associated kernel descriptor. It is the address of the
620 kernel descriptor that is used in the AQL dispatch packet used to invoke the
621 kernel, not the kernel entry point. The layout of the HSA kernel descriptor is
622 defined in :ref:`amdgpu-amdhsa-kernel-descriptor`.
625 Every HSA kernel also has a symbol for its machine code entry point.
627 .. _amdgpu-relocation-records:
632 AMDGPU backend generates ``Elf64_Rela`` relocation records. Supported
633 relocatable fields are:
636 This specifies a 32-bit field occupying 4 bytes with arbitrary byte
637 alignment. These values use the same byte order as other word values in the
638 AMD GPU architecture.
641 This specifies a 64-bit field occupying 8 bytes with arbitrary byte
642 alignment. These values use the same byte order as other word values in the
643 AMD GPU architecture.
645 Following notations are used for specifying relocation calculations:
648 Represents the addend used to compute the value of the relocatable field.
651 Represents the offset into the global offset table at which the relocation
652 entry’s symbol will reside during execution.
655 Represents the address of the global offset table.
658 Represents the place (section offset for ``et_rel`` or address for ``et_dyn``)
659 of the storage unit being relocated (computed using ``r_offset``).
662 Represents the value of the symbol whose index resides in the relocation
665 The following relocation types are supported:
667 .. table:: AMDGPU ELF Relocation Records
668 :name: amdgpu-elf-relocation-records-table
670 ========================== ===== ========== ==============================
671 Relocation Type Value Field Calculation
672 ========================== ===== ========== ==============================
673 ``R_AMDGPU_NONE`` 0 *none* *none*
674 ``R_AMDGPU_ABS32_LO`` 1 ``word32`` (S + A) & 0xFFFFFFFF
675 ``R_AMDGPU_ABS32_HI`` 2 ``word32`` (S + A) >> 32
676 ``R_AMDGPU_ABS64`` 3 ``word64`` S + A
677 ``R_AMDGPU_REL32`` 4 ``word32`` S + A - P
678 ``R_AMDGPU_REL64`` 5 ``word64`` S + A - P
679 ``R_AMDGPU_ABS32`` 6 ``word32`` S + A
680 ``R_AMDGPU_GOTPCREL`` 7 ``word32`` G + GOT + A - P
681 ``R_AMDGPU_GOTPCREL32_LO`` 8 ``word32`` (G + GOT + A - P) & 0xFFFFFFFF
682 ``R_AMDGPU_GOTPCREL32_HI`` 9 ``word32`` (G + GOT + A - P) >> 32
683 ``R_AMDGPU_REL32_LO`` 10 ``word32`` (S + A - P) & 0xFFFFFFFF
684 ``R_AMDGPU_REL32_HI`` 11 ``word32`` (S + A - P) >> 32
685 ========================== ===== ========== ==============================
692 Standard DWARF [DWARF]_ Version 2 sections can be generated. These contain
693 information that maps the code object executable code and data to the source
694 language constructs. It can be used by tools such as debuggers and profilers.
696 Address Space Mapping
697 ~~~~~~~~~~~~~~~~~~~~~
699 The following address space mapping is used:
701 .. table:: AMDGPU DWARF Address Space Mapping
702 :name: amdgpu-dwarf-address-space-mapping-table
704 =================== =================
705 DWARF Address Space Memory Space
706 =================== =================
711 *omitted* Generic (Flat)
712 *not supported* Region (GDS)
713 =================== =================
715 See :ref:`amdgpu-address-spaces` for information on the memory space terminology
718 An ``address_class`` attribute is generated on pointer type DIEs to specify the
719 DWARF address space of the value of the pointer when it is in the *private* or
720 *local* address space. Otherwise the attribute is omitted.
722 An ``XDEREF`` operation is generated in location list expressions for variables
723 that are allocated in the *private* and *local* address space. Otherwise no
724 ``XDREF`` is omitted.
729 *This section is WIP.*
732 Define DWARF register enumeration.
734 If want to present a wavefront state then should expose vector registers as
735 64 wide (rather than per work-item view that LLVM uses). Either as separate
736 registers, or a 64x4 byte single register. In either case use a new LANE op
737 (akin to XDREF) to select the current lane usage in a location
738 expression. This would also allow scalar register spilling to vector register
739 lanes to be expressed (currently no debug information is being generated for
740 spilling). If choose a wide single register approach then use LANE in
741 conjunction with PIECE operation to select the dword part of the register for
742 the current lane. If the separate register approach then use LANE to select
748 *This section is WIP.*
751 DWARF extension to include runtime generated source text.
753 .. _amdgpu-code-conventions:
758 This section provides code conventions used for each supported target triple OS
759 (see :ref:`amdgpu-target-triples`).
764 This section provides code conventions used when the target triple OS is
765 ``amdhsa`` (see :ref:`amdgpu-target-triples`).
767 .. _amdgpu-amdhsa-hsa-code-object-metadata:
772 The code object metadata specifies extensible metadata associated with the code
773 objects executed on HSA [HSA]_ compatible runtimes such as AMD's ROCm
774 [AMD-ROCm]_. It is specified by the ``NT_AMD_AMDGPU_HSA_METADATA`` note record
775 (see :ref:`amdgpu-note-records`) and is required when the target triple OS is
776 ``amdhsa`` (see :ref:`amdgpu-target-triples`). It must contain the minimum
777 information necessary to support the ROCM kernel queries. For example, the
778 segment sizes needed in a dispatch packet. In addition, a high level language
779 runtime may require other information to be included. For example, the AMD
780 OpenCL runtime records kernel argument information.
782 The metadata is specified as a YAML formatted string (see [YAML]_ and
786 Is the string null terminated? It probably should not if YAML allows it to
787 contain null characters, otherwise it should be.
789 The metadata is represented as a single YAML document comprised of the mapping
790 defined in table :ref:`amdgpu-amdhsa-code-object-metadata-mapping-table` and
793 For boolean values, the string values of ``false`` and ``true`` are used for
794 false and true respectively.
796 Additional information can be added to the mappings. To avoid conflicts, any
797 non-AMD key names should be prefixed by "*vendor-name*.".
799 .. table:: AMDHSA Code Object Metadata Mapping
800 :name: amdgpu-amdhsa-code-object-metadata-mapping-table
802 ========== ============== ========= =======================================
803 String Key Value Type Required? Description
804 ========== ============== ========= =======================================
805 "Version" sequence of Required - The first integer is the major
806 2 integers version. Currently 1.
807 - The second integer is the minor
808 version. Currently 0.
809 "Printf" sequence of Each string is encoded information
810 strings about a printf function call. The
811 encoded information is organized as
812 fields separated by colon (':'):
814 ``ID:N:S[0]:S[1]:...:S[N-1]:FormatString``
819 A 32 bit integer as a unique id for
820 each printf function call
823 A 32 bit integer equal to the number
824 of arguments of printf function call
827 ``S[i]`` (where i = 0, 1, ... , N-1)
828 32 bit integers for the size in bytes
829 of the i-th FormatString argument of
830 the printf function call
833 The format string passed to the
834 printf function call.
835 "Kernels" sequence of Required Sequence of the mappings for each
836 mapping kernel in the code object. See
837 :ref:`amdgpu-amdhsa-code-object-kernel-metadata-mapping-table`
838 for the definition of the mapping.
839 ========== ============== ========= =======================================
843 .. table:: AMDHSA Code Object Kernel Metadata Mapping
844 :name: amdgpu-amdhsa-code-object-kernel-metadata-mapping-table
846 ================= ============== ========= ================================
847 String Key Value Type Required? Description
848 ================= ============== ========= ================================
849 "Name" string Required Source name of the kernel.
850 "SymbolName" string Required Name of the kernel
851 descriptor ELF symbol.
852 "Language" string Source language of the kernel.
860 "LanguageVersion" sequence of - The first integer is the major
862 - The second integer is the
864 "Attrs" mapping Mapping of kernel attributes.
866 :ref:`amdgpu-amdhsa-code-object-kernel-attribute-metadata-mapping-table`
867 for the mapping definition.
868 "Arguments" sequence of Sequence of mappings of the
869 mapping kernel arguments. See
870 :ref:`amdgpu-amdhsa-code-object-kernel-argument-metadata-mapping-table`
871 for the definition of the mapping.
872 "CodeProps" mapping Mapping of properties related to
874 :ref:`amdgpu-amdhsa-code-object-kernel-code-properties-metadata-mapping-table`
875 for the mapping definition.
876 "DebugProps" mapping Mapping of properties related to
877 the kernel debugging. See
878 :ref:`amdgpu-amdhsa-code-object-kernel-debug-properties-metadata-mapping-table`
879 for the mapping definition.
880 ================= ============== ========= ================================
884 .. table:: AMDHSA Code Object Kernel Attribute Metadata Mapping
885 :name: amdgpu-amdhsa-code-object-kernel-attribute-metadata-mapping-table
887 =================== ============== ========= ==============================
888 String Key Value Type Required? Description
889 =================== ============== ========= ==============================
890 "ReqdWorkGroupSize" sequence of The dispatch work-group size
891 3 integers X, Y, Z must correspond to the
894 Corresponds to the OpenCL
895 ``reqd_work_group_size``
897 "WorkGroupSizeHint" sequence of The dispatch work-group size
898 3 integers X, Y, Z is likely to be the
901 Corresponds to the OpenCL
902 ``work_group_size_hint``
904 "VecTypeHint" string The name of a scalar or vector
907 Corresponds to the OpenCL
908 ``vec_type_hint`` attribute.
909 =================== ============== ========= ==============================
913 .. table:: AMDHSA Code Object Kernel Argument Metadata Mapping
914 :name: amdgpu-amdhsa-code-object-kernel-argument-metadata-mapping-table
916 ================= ============== ========= ================================
917 String Key Value Type Required? Description
918 ================= ============== ========= ================================
919 "Name" string Kernel argument name.
920 "TypeName" string Kernel argument type name.
921 "Size" integer Required Kernel argument size in bytes.
922 "Align" integer Required Kernel argument alignment in
923 bytes. Must be a power of two.
924 "ValueKind" string Required Kernel argument kind that
925 specifies how to set up the
926 corresponding argument.
930 The argument is copied
931 directly into the kernarg.
934 A global address space pointer
935 to the buffer data is passed
938 "DynamicSharedPointer"
939 A group address space pointer
940 to dynamically allocated LDS
941 is passed in the kernarg.
944 A global address space
945 pointer to a S# is passed in
949 A global address space
950 pointer to a T# is passed in
954 A global address space pointer
955 to an OpenCL pipe is passed in
959 A global address space pointer
960 to an OpenCL device enqueue
961 queue is passed in the
964 "HiddenGlobalOffsetX"
965 The OpenCL grid dispatch
966 global offset for the X
967 dimension is passed in the
970 "HiddenGlobalOffsetY"
971 The OpenCL grid dispatch
972 global offset for the Y
973 dimension is passed in the
976 "HiddenGlobalOffsetZ"
977 The OpenCL grid dispatch
978 global offset for the Z
979 dimension is passed in the
983 An argument that is not used
984 by the kernel. Space needs to
985 be left for it, but it does
986 not need to be set up.
989 A global address space pointer
990 to the runtime printf buffer
991 is passed in kernarg.
994 A global address space pointer
995 to the OpenCL device enqueue
996 queue that should be used by
997 the kernel by default is
998 passed in the kernarg.
1000 "HiddenCompletionAction"
1006 "ValueType" string Required Kernel argument value type. Only
1007 present if "ValueKind" is
1008 "ByValue". For vector data
1009 types, the value is for the
1010 element type. Values include:
1026 How can it be determined if a
1027 vector type, and what size
1029 "PointeeAlign" integer Alignment in bytes of pointee
1030 type for pointer type kernel
1031 argument. Must be a power
1032 of 2. Only present if
1034 "DynamicSharedPointer".
1035 "AddrSpaceQual" string Kernel argument address space
1036 qualifier. Only present if
1037 "ValueKind" is "GlobalBuffer" or
1038 "DynamicSharedPointer". Values
1049 Is GlobalBuffer only Global
1051 DynamicSharedPointer always
1052 Local? Can HCC allow Generic?
1053 How can Private or Region
1055 "AccQual" string Kernel argument access
1056 qualifier. Only present if
1057 "ValueKind" is "Image" or
1068 "ActualAcc" string The actual memory accesses
1069 performed by the kernel on the
1070 kernel argument. Only present if
1071 "ValueKind" is "GlobalBuffer",
1072 "Image", or "Pipe". This may be
1073 more restrictive than indicated
1074 by "AccQual" to reflect what the
1075 kernel actual does. If not
1076 present then the runtime must
1077 assume what is implied by
1078 "AccQual" and "IsConst". Values
1085 "IsConst" boolean Indicates if the kernel argument
1086 is const qualified. Only present
1090 "IsRestrict" boolean Indicates if the kernel argument
1091 is restrict qualified. Only
1092 present if "ValueKind" is
1095 "IsVolatile" boolean Indicates if the kernel argument
1096 is volatile qualified. Only
1097 present if "ValueKind" is
1100 "IsPipe" boolean Indicates if the kernel argument
1101 is pipe qualified. Only present
1102 if "ValueKind" is "Pipe".
1105 Can GlobalBuffer be pipe
1107 ================= ============== ========= ================================
1111 .. table:: AMDHSA Code Object Kernel Code Properties Metadata Mapping
1112 :name: amdgpu-amdhsa-code-object-kernel-code-properties-metadata-mapping-table
1114 ============================ ============== ========= =====================
1115 String Key Value Type Required? Description
1116 ============================ ============== ========= =====================
1117 "KernargSegmentSize" integer Required The size in bytes of
1119 that holds the values
1122 "GroupSegmentFixedSize" integer Required The amount of group
1126 bytes. This does not
1128 dynamically allocated
1129 group segment memory
1133 "PrivateSegmentFixedSize" integer Required The amount of fixed
1134 private address space
1135 memory required for a
1139 is 1 then additional
1141 to this value for the
1143 "KernargSegmentAlign" integer Required The maximum byte
1146 kernarg segment. Must
1148 "WavefrontSize" integer Required Wavefront size. Must
1150 "NumSGPRs" integer Number of scalar
1154 includes the special
1160 SGPR added if a trap
1166 "NumVGPRs" integer Number of vector
1170 "MaxFlatWorkgroupSize" integer Maximum flat
1173 kernel in work-items.
1174 "IsDynamicCallStack" boolean Indicates if the
1179 "IsXNACKEnabled" boolean Indicates if the
1183 ============================ ============== ========= =====================
1187 .. table:: AMDHSA Code Object Kernel Debug Properties Metadata Mapping
1188 :name: amdgpu-amdhsa-code-object-kernel-debug-properties-metadata-mapping-table
1190 =================================== ============== ========= ==============
1191 String Key Value Type Required? Description
1192 =================================== ============== ========= ==============
1193 "DebuggerABIVersion" string
1194 "ReservedNumVGPRs" integer
1195 "ReservedFirstVGPR" integer
1196 "PrivateSegmentBufferSGPR" integer
1197 "WavefrontPrivateSegmentOffsetSGPR" integer
1198 =================================== ============== ========= ==============
1201 Plan to remove the debug properties metadata.
1206 The HSA architected queuing language (AQL) defines a user space memory interface
1207 that can be used to control the dispatch of kernels, in an agent independent
1208 way. An agent can have zero or more AQL queues created for it using the ROCm
1209 runtime, in which AQL packets (all of which are 64 bytes) can be placed. See the
1210 *HSA Platform System Architecture Specification* [HSA]_ for the AQL queue
1211 mechanics and packet layouts.
1213 The packet processor of a kernel agent is responsible for detecting and
1214 dispatching HSA kernels from the AQL queues associated with it. For AMD GPUs the
1215 packet processor is implemented by the hardware command processor (CP),
1216 asynchronous dispatch controller (ADC) and shader processor input controller
1219 The ROCm runtime can be used to allocate an AQL queue object. It uses the kernel
1220 mode driver to initialize and register the AQL queue with CP.
1222 To dispatch a kernel the following actions are performed. This can occur in the
1223 CPU host program, or from an HSA kernel executing on a GPU.
1225 1. A pointer to an AQL queue for the kernel agent on which the kernel is to be
1226 executed is obtained.
1227 2. A pointer to the kernel descriptor (see
1228 :ref:`amdgpu-amdhsa-kernel-descriptor`) of the kernel to execute is
1229 obtained. It must be for a kernel that is contained in a code object that that
1230 was loaded by the ROCm runtime on the kernel agent with which the AQL queue is
1232 3. Space is allocated for the kernel arguments using the ROCm runtime allocator
1233 for a memory region with the kernarg property for the kernel agent that will
1234 execute the kernel. It must be at least 16 byte aligned.
1235 4. Kernel argument values are assigned to the kernel argument memory
1236 allocation. The layout is defined in the *HSA Programmer’s Language Reference*
1237 [HSA]_. For AMDGPU the kernel execution directly accesses the kernel argument
1238 memory in the same way constant memory is accessed. (Note that the HSA
1239 specification allows an implementation to copy the kernel argument contents to
1240 another location that is accessed by the kernel.)
1241 5. An AQL kernel dispatch packet is created on the AQL queue. The ROCm runtime
1242 api uses 64 bit atomic operations to reserve space in the AQL queue for the
1243 packet. The packet must be set up, and the final write must use an atomic
1244 store release to set the packet kind to ensure the packet contents are
1245 visible to the kernel agent. AQL defines a doorbell signal mechanism to
1246 notify the kernel agent that the AQL queue has been updated. These rules, and
1247 the layout of the AQL queue and kernel dispatch packet is defined in the *HSA
1248 System Architecture Specification* [HSA]_.
1249 6. A kernel dispatch packet includes information about the actual dispatch,
1250 such as grid and work-group size, together with information from the code
1251 object about the kernel, such as segment sizes. The ROCm runtime queries on
1252 the kernel symbol can be used to obtain the code object values which are
1253 recorded in the :ref:`amdgpu-amdhsa-hsa-code-object-metadata`.
1254 7. CP executes micro-code and is responsible for detecting and setting up the
1255 GPU to execute the wavefronts of a kernel dispatch.
1256 8. CP ensures that when the a wavefront starts executing the kernel machine
1257 code, the scalar general purpose registers (SGPR) and vector general purpose
1258 registers (VGPR) are set up as required by the machine code. The required
1259 setup is defined in the :ref:`amdgpu-amdhsa-kernel-descriptor`. The initial
1260 register state is defined in
1261 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`.
1262 9. The prolog of the kernel machine code (see
1263 :ref:`amdgpu-amdhsa-kernel-prolog`) sets up the machine state as necessary
1264 before continuing executing the machine code that corresponds to the kernel.
1265 10. When the kernel dispatch has completed execution, CP signals the completion
1266 signal specified in the kernel dispatch packet if not 0.
1268 .. _amdgpu-amdhsa-memory-spaces:
1273 The memory space properties are:
1275 .. table:: AMDHSA Memory Spaces
1276 :name: amdgpu-amdhsa-memory-spaces-table
1278 ================= =========== ======== ======= ==================
1279 Memory Space Name HSA Segment Hardware Address NULL Value
1281 ================= =========== ======== ======= ==================
1282 Private private scratch 32 0x00000000
1283 Local group LDS 32 0xFFFFFFFF
1284 Global global global 64 0x0000000000000000
1285 Constant constant *same as 64 0x0000000000000000
1287 Generic flat flat 64 0x0000000000000000
1288 Region N/A GDS 32 *not implemented
1290 ================= =========== ======== ======= ==================
1292 The global and constant memory spaces both use global virtual addresses, which
1293 are the same virtual address space used by the CPU. However, some virtual
1294 addresses may only be accessible to the CPU, some only accessible by the GPU,
1297 Using the constant memory space indicates that the data will not change during
1298 the execution of the kernel. This allows scalar read instructions to be
1299 used. The vector and scalar L1 caches are invalidated of volatile data before
1300 each kernel dispatch execution to allow constant memory to change values between
1303 The local memory space uses the hardware Local Data Store (LDS) which is
1304 automatically allocated when the hardware creates work-groups of wavefronts, and
1305 freed when all the wavefronts of a work-group have terminated. The data store
1306 (DS) instructions can be used to access it.
1308 The private memory space uses the hardware scratch memory support. If the kernel
1309 uses scratch, then the hardware allocates memory that is accessed using
1310 wavefront lane dword (4 byte) interleaving. The mapping used from private
1311 address to physical address is:
1313 ``wavefront-scratch-base +
1314 (private-address * wavefront-size * 4) +
1315 (wavefront-lane-id * 4)``
1317 There are different ways that the wavefront scratch base address is determined
1318 by a wavefront (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). This
1319 memory can be accessed in an interleaved manner using buffer instruction with
1320 the scratch buffer descriptor and per wave scratch offset, by the scratch
1321 instructions, or by flat instructions. If each lane of a wavefront accesses the
1322 same private address, the interleaving results in adjacent dwords being accessed
1323 and hence requires fewer cache lines to be fetched. Multi-dword access is not
1324 supported except by flat and scratch instructions in GFX9.
1326 The generic address space uses the hardware flat address support available in
1327 GFX7-GFX9. This uses two fixed ranges of virtual addresses (the private and
1328 local appertures), that are outside the range of addressible global memory, to
1329 map from a flat address to a private or local address.
1331 FLAT instructions can take a flat address and access global, private (scratch)
1332 and group (LDS) memory depending in if the address is within one of the
1333 apperture ranges. Flat access to scratch requires hardware aperture setup and
1334 setup in the kernel prologue (see :ref:`amdgpu-amdhsa-flat-scratch`). Flat
1335 access to LDS requires hardware aperture setup and M0 (GFX7-GFX8) register setup
1336 (see :ref:`amdgpu-amdhsa-m0`).
1338 To convert between a segment address and a flat address the base address of the
1339 appertures address can be used. For GFX7-GFX8 these are available in the
1340 :ref:`amdgpu-amdhsa-hsa-aql-queue` the address of which can be obtained with
1341 Queue Ptr SGPR (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). For
1342 GFX9 the appature base addresses are directly available as inline constant
1343 registers ``SRC_SHARED_BASE/LIMIT`` and ``SRC_PRIVATE_BASE/LIMIT``. In 64 bit
1344 address mode the apperture sizes are 2^32 bytes and the base is aligned to 2^32
1345 which makes it easier to convert from flat to segment or segment to flat.
1350 Image and sample handles created by the ROCm runtime are 64 bit addresses of a
1351 hardware 32 byte V# and 48 byte S# object respectively. In order to support the
1352 HSA ``query_sampler`` operations two extra dwords are used to store the HSA BRIG
1353 enumeration values for the queries that are not trivially deducible from the S#
1359 HSA signal handles created by the ROCm runtime are 64 bit addresses of a
1360 structure allocated in memory accessible from both the CPU and GPU. The
1361 structure is defined by the ROCm runtime and subject to change between releases
1362 (see [AMD-ROCm-github]_).
1364 .. _amdgpu-amdhsa-hsa-aql-queue:
1369 The HSA AQL queue structure is defined by the ROCm runtime and subject to change
1370 between releases (see [AMD-ROCm-github]_). For some processors it contains
1371 fields needed to implement certain language features such as the flat address
1372 aperture bases. It also contains fields used by CP such as managing the
1373 allocation of scratch memory.
1375 .. _amdgpu-amdhsa-kernel-descriptor:
1380 A kernel descriptor consists of the information needed by CP to initiate the
1381 execution of a kernel, including the entry point address of the machine code
1382 that implements the kernel.
1384 Kernel Descriptor for GFX6-GFX9
1385 +++++++++++++++++++++++++++++++
1387 CP microcode requires the Kernel descritor to be allocated on 64 byte alignment.
1389 .. table:: Kernel Descriptor for GFX6-GFX9
1390 :name: amdgpu-amdhsa-kernel-descriptor-gfx6-gfx9-table
1392 ======= ======= =============================== ===========================
1393 Bits Size Field Name Description
1394 ======= ======= =============================== ===========================
1395 31:0 4 bytes group_segment_fixed_size The amount of fixed local
1396 address space memory
1397 required for a work-group
1398 in bytes. This does not
1399 include any dynamically
1400 allocated local address
1401 space memory that may be
1402 added when the kernel is
1404 63:32 4 bytes private_segment_fixed_size The amount of fixed
1405 private address space
1406 memory required for a
1407 work-item in bytes. If
1408 is_dynamic_callstack is 1
1409 then additional space must
1410 be added to this value for
1412 95:64 4 bytes max_flat_workgroup_size Maximum flat work-group
1413 size supported by the
1414 kernel in work-items.
1415 96 1 bit is_dynamic_call_stack Indicates if the generated
1416 machine code is using a
1417 dynamically sized call
1419 97 1 bit is_xnack_enabled Indicates if the generated
1420 machine code is capable of
1422 127:98 30 bits Reserved. Must be 0.
1423 191:128 8 bytes kernel_code_entry_byte_offset Byte offset (possibly
1426 descriptor to kernel's
1427 entry point instruction
1428 which must be 256 byte
1430 383:192 24 Reserved. Must be 0.
1432 415:384 4 bytes compute_pgm_rsrc1 Compute Shader (CS)
1433 program settings used by
1435 ``COMPUTE_PGM_RSRC1``
1438 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1_t-gfx6-gfx9-table`.
1439 447:416 4 bytes compute_pgm_rsrc2 Compute Shader (CS)
1440 program settings used by
1442 ``COMPUTE_PGM_RSRC2``
1445 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`.
1446 448 1 bit enable_sgpr_private_segment Enable the setup of the
1447 _buffer SGPR user data registers
1449 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1451 The total number of SGPR
1453 requested must not exceed
1454 16 and match value in
1455 ``compute_pgm_rsrc2.user_sgpr.user_sgpr_count``.
1456 Any requests beyond 16
1458 449 1 bit enable_sgpr_dispatch_ptr *see above*
1459 450 1 bit enable_sgpr_queue_ptr *see above*
1460 451 1 bit enable_sgpr_kernarg_segment_ptr *see above*
1461 452 1 bit enable_sgpr_dispatch_id *see above*
1462 453 1 bit enable_sgpr_flat_scratch_init *see above*
1463 454 1 bit enable_sgpr_private_segment *see above*
1465 455 1 bit enable_sgpr_grid_workgroup Not implemented in CP and
1466 _count_X should always be 0.
1467 456 1 bit enable_sgpr_grid_workgroup Not implemented in CP and
1468 _count_Y should always be 0.
1469 457 1 bit enable_sgpr_grid_workgroup Not implemented in CP and
1470 _count_Z should always be 0.
1471 463:458 6 bits Reserved. Must be 0.
1472 511:464 4 Reserved. Must be 0.
1474 512 **Total size 64 bytes.**
1475 ======= ===================================================================
1479 .. table:: compute_pgm_rsrc1 for GFX6-GFX9
1480 :name: amdgpu-amdhsa-compute_pgm_rsrc1_t-gfx6-gfx9-table
1482 ======= ======= =============================== ===========================================================================
1483 Bits Size Field Name Description
1484 ======= ======= =============================== ===========================================================================
1485 5:0 6 bits granulated_workitem_vgpr_count Number of vector registers
1486 used by each work-item,
1487 granularity is device
1491 roundup((max-vgpg + 1)
1494 Used by CP to set up
1495 ``COMPUTE_PGM_RSRC1.VGPRS``.
1496 9:6 4 bits granulated_wavefront_sgpr_count Number of scalar registers
1497 used by a wavefront,
1498 granularity is device
1502 roundup((max-sgpg + 1)
1505 roundup((max-sgpg + 1)
1508 Includes the special SGPRs
1509 for VCC, Flat Scratch (for
1510 GFX7 onwards) and XNACK
1511 (for GFX8 onwards). It does
1512 not include the 16 SGPR
1513 added if a trap handler is
1516 Used by CP to set up
1517 ``COMPUTE_PGM_RSRC1.SGPRS``.
1518 11:10 2 bits priority Must be 0.
1520 Start executing wavefront
1521 at the specified priority.
1523 CP is responsible for
1525 ``COMPUTE_PGM_RSRC1.PRIORITY``.
1526 13:12 2 bits float_mode_round_32 Wavefront starts execution
1527 with specified rounding
1530 precision floating point
1533 Floating point rounding
1534 mode values are defined in
1535 :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
1537 Used by CP to set up
1538 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
1539 15:14 2 bits float_mode_round_16_64 Wavefront starts execution
1540 with specified rounding
1541 denorm mode for half/double (16
1542 and 64 bit) floating point
1543 precision floating point
1546 Floating point rounding
1547 mode values are defined in
1548 :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
1550 Used by CP to set up
1551 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
1552 17:16 2 bits float_mode_denorm_32 Wavefront starts execution
1553 with specified denorm mode
1556 precision floating point
1559 Floating point denorm mode
1560 values are defined in
1561 :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
1563 Used by CP to set up
1564 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
1565 19:18 2 bits float_mode_denorm_16_64 Wavefront starts execution
1566 with specified denorm mode
1568 and 64 bit) floating point
1569 precision floating point
1572 Floating point denorm mode
1573 values are defined in
1574 :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
1576 Used by CP to set up
1577 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
1578 20 1 bit priv Must be 0.
1580 Start executing wavefront
1581 in privilege trap handler
1584 CP is responsible for
1586 ``COMPUTE_PGM_RSRC1.PRIV``.
1587 21 1 bit enable_dx10_clamp Wavefront starts execution
1588 with DX10 clamp mode
1589 enabled. Used by the vector
1590 ALU to force DX-10 style
1591 treatment of NaN's (when
1592 set, clamp NaN to zero,
1596 Used by CP to set up
1597 ``COMPUTE_PGM_RSRC1.DX10_CLAMP``.
1598 22 1 bit debug_mode Must be 0.
1600 Start executing wavefront
1601 in single step mode.
1603 CP is responsible for
1605 ``COMPUTE_PGM_RSRC1.DEBUG_MODE``.
1606 23 1 bit enable_ieee_mode Wavefront starts execution
1608 enabled. Floating point
1609 opcodes that support
1610 exception flag gathering
1611 will quiet and propagate
1612 signaling-NaN inputs per
1613 IEEE 754-2008. Min_dx10 and
1614 max_dx10 become IEEE
1615 754-2008 compliant due to
1616 signaling-NaN propagation
1619 Used by CP to set up
1620 ``COMPUTE_PGM_RSRC1.IEEE_MODE``.
1621 24 1 bit bulky Must be 0.
1623 Only one work-group allowed
1624 to execute on a compute
1627 CP is responsible for
1629 ``COMPUTE_PGM_RSRC1.BULKY``.
1630 25 1 bit cdbg_user Must be 0.
1632 Flag that can be used to
1633 control debugging code.
1635 CP is responsible for
1637 ``COMPUTE_PGM_RSRC1.CDBG_USER``.
1638 31:26 6 bits Reserved. Must be 0.
1639 32 **Total size 4 bytes**
1640 ======= ===================================================================================================================
1644 .. table:: compute_pgm_rsrc2 for GFX6-GFX9
1645 :name: amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table
1647 ======= ======= =============================== ===========================================================================
1648 Bits Size Field Name Description
1649 ======= ======= =============================== ===========================================================================
1650 0 1 bit enable_sgpr_private_segment Enable the setup of the
1651 _wave_offset SGPR wave scratch offset
1652 system register (see
1653 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1655 Used by CP to set up
1656 ``COMPUTE_PGM_RSRC2.SCRATCH_EN``.
1657 5:1 5 bits user_sgpr_count The total number of SGPR
1659 requested. This number must
1660 match the number of user
1661 data registers enabled.
1663 Used by CP to set up
1664 ``COMPUTE_PGM_RSRC2.USER_SGPR``.
1665 6 1 bit enable_trap_handler Set to 1 if code contains a
1666 TRAP instruction which
1667 requires a trap handler to
1671 ``COMPUTE_PGM_RSRC2.TRAP_PRESENT``
1673 installed a trap handler
1674 regardless of the setting
1676 7 1 bit enable_sgpr_workgroup_id_x Enable the setup of the
1677 system SGPR register for
1678 the work-group id in the X
1680 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1682 Used by CP to set up
1683 ``COMPUTE_PGM_RSRC2.TGID_X_EN``.
1684 8 1 bit enable_sgpr_workgroup_id_y Enable the setup of the
1685 system SGPR register for
1686 the work-group id in the Y
1688 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1690 Used by CP to set up
1691 ``COMPUTE_PGM_RSRC2.TGID_Y_EN``.
1692 9 1 bit enable_sgpr_workgroup_id_z Enable the setup of the
1693 system SGPR register for
1694 the work-group id in the Z
1696 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1698 Used by CP to set up
1699 ``COMPUTE_PGM_RSRC2.TGID_Z_EN``.
1700 10 1 bit enable_sgpr_workgroup_info Enable the setup of the
1701 system SGPR register for
1702 work-group information (see
1703 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1705 Used by CP to set up
1706 ``COMPUTE_PGM_RSRC2.TGID_SIZE_EN``.
1707 12:11 2 bits enable_vgpr_workitem_id Enable the setup of the
1708 VGPR system registers used
1709 for the work-item ID.
1710 :ref:`amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table`
1713 Used by CP to set up
1714 ``COMPUTE_PGM_RSRC2.TIDIG_CMP_CNT``.
1715 13 1 bit enable_exception_address_watch Must be 0.
1717 Wavefront starts execution
1719 exceptions enabled which
1720 are generated when L1 has
1721 witnessed a thread access
1725 CP is responsible for
1726 filling in the address
1728 ``COMPUTE_PGM_RSRC2.EXCP_EN_MSB``
1729 according to what the
1731 14 1 bit enable_exception_memory Must be 0.
1733 Wavefront starts execution
1734 with memory violation
1735 exceptions exceptions
1736 enabled which are generated
1737 when a memory violation has
1738 occurred for this wave from
1740 (write-to-read-only-memory,
1741 mis-aligned atomic, LDS
1742 address out of range,
1743 illegal address, etc.).
1747 ``COMPUTE_PGM_RSRC2.EXCP_EN_MSB``
1748 according to what the
1750 23:15 9 bits granulated_lds_size Must be 0.
1752 CP uses the rounded value
1753 from the dispatch packet,
1754 not this value, as the
1755 dispatch may contain
1756 dynamically allocated group
1757 segment memory. CP writes
1759 ``COMPUTE_PGM_RSRC2.LDS_SIZE``.
1761 Amount of group segment
1762 (LDS) to allocate for each
1763 work-group. Granularity is
1767 roundup(lds-size / (64 * 4))
1769 roundup(lds-size / (128 * 4))
1771 24 1 bit enable_exception_ieee_754_fp Wavefront starts execution
1772 _invalid_operation with specified exceptions
1775 Used by CP to set up
1776 ``COMPUTE_PGM_RSRC2.EXCP_EN``
1777 (set from bits 0..6).
1781 25 1 bit enable_exception_fp_denormal FP Denormal one or more
1782 _source input operands is a
1784 26 1 bit enable_exception_ieee_754_fp IEEE 754 FP Division by
1785 _division_by_zero Zero
1786 27 1 bit enable_exception_ieee_754_fp IEEE 754 FP FP Overflow
1788 28 1 bit enable_exception_ieee_754_fp IEEE 754 FP Underflow
1790 29 1 bit enable_exception_ieee_754_fp IEEE 754 FP Inexact
1792 30 1 bit enable_exception_int_divide_by Integer Division by Zero
1793 _zero (rcp_iflag_f32 instruction
1795 31 1 bit Reserved. Must be 0.
1796 32 **Total size 4 bytes.**
1797 ======= ===================================================================================================================
1801 .. table:: Floating Point Rounding Mode Enumeration Values
1802 :name: amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table
1804 ===================================== ===== ===============================
1805 Enumeration Name Value Description
1806 ===================================== ===== ===============================
1807 AMD_FLOAT_ROUND_MODE_NEAR_EVEN 0 Round Ties To Even
1808 AMD_FLOAT_ROUND_MODE_PLUS_INFINITY 1 Round Toward +infinity
1809 AMD_FLOAT_ROUND_MODE_MINUS_INFINITY 2 Round Toward -infinity
1810 AMD_FLOAT_ROUND_MODE_ZERO 3 Round Toward 0
1811 ===================================== ===== ===============================
1815 .. table:: Floating Point Denorm Mode Enumeration Values
1816 :name: amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table
1818 ===================================== ===== ===============================
1819 Enumeration Name Value Description
1820 ===================================== ===== ===============================
1821 AMD_FLOAT_DENORM_MODE_FLUSH_SRC_DST 0 Flush Source and Destination
1823 AMD_FLOAT_DENORM_MODE_FLUSH_DST 1 Flush Output Denorms
1824 AMD_FLOAT_DENORM_MODE_FLUSH_SRC 2 Flush Source Denorms
1825 AMD_FLOAT_DENORM_MODE_FLUSH_NONE 3 No Flush
1826 ===================================== ===== ===============================
1830 .. table:: System VGPR Work-Item ID Enumeration Values
1831 :name: amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table
1833 ===================================== ===== ===============================
1834 Enumeration Name Value Description
1835 ===================================== ===== ===============================
1836 AMD_SYSTEM_VGPR_WORKITEM_ID_X 0 Set work-item X dimension ID.
1837 AMD_SYSTEM_VGPR_WORKITEM_ID_X_Y 1 Set work-item X and Y
1839 AMD_SYSTEM_VGPR_WORKITEM_ID_X_Y_Z 2 Set work-item X, Y and Z
1841 AMD_SYSTEM_VGPR_WORKITEM_ID_UNDEFINED 3 Undefined.
1842 ===================================== ===== ===============================
1844 .. _amdgpu-amdhsa-initial-kernel-execution-state:
1846 Initial Kernel Execution State
1847 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1849 This section defines the register state that will be set up by the packet
1850 processor prior to the start of execution of every wavefront. This is limited by
1851 the constraints of the hardware controllers of CP/ADC/SPI.
1853 The order of the SGPR registers is defined, but the compiler can specify which
1854 ones are actually setup in the kernel descriptor using the ``enable_sgpr_*`` bit
1855 fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used
1856 for enabled registers are dense starting at SGPR0: the first enabled register is
1857 SGPR0, the next enabled register is SGPR1 etc.; disabled registers do not have
1860 The initial SGPRs comprise up to 16 User SRGPs that are set by CP and apply to
1861 all waves of the grid. It is possible to specify more than 16 User SGPRs using
1862 the ``enable_sgpr_*`` bit fields, in which case only the first 16 are actually
1863 initialized. These are then immediately followed by the System SGPRs that are
1864 set up by ADC/SPI and can have different values for each wave of the grid
1867 SGPR register initial state is defined in
1868 :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
1870 .. table:: SGPR Register Set Up Order
1871 :name: amdgpu-amdhsa-sgpr-register-set-up-order-table
1873 ========== ========================== ====== ==============================
1874 SGPR Order Name Number Description
1875 (kernel descriptor enable of
1877 ========== ========================== ====== ==============================
1878 First Private Segment Buffer 4 V# that can be used, together
1879 (enable_sgpr_private with Scratch Wave Offset as an
1880 _segment_buffer) offset, to access the private
1881 memory space using a segment
1884 CP uses the value provided by
1886 then Dispatch Ptr 2 64 bit address of AQL dispatch
1887 (enable_sgpr_dispatch_ptr) packet for kernel dispatch
1889 then Queue Ptr 2 64 bit address of amd_queue_t
1890 (enable_sgpr_queue_ptr) object for AQL queue on which
1891 the dispatch packet was
1893 then Kernarg Segment Ptr 2 64 bit address of Kernarg
1894 (enable_sgpr_kernarg segment. This is directly
1895 _segment_ptr) copied from the
1896 kernarg_address in the kernel
1899 Having CP load it once avoids
1900 loading it at the beginning of
1902 then Dispatch Id 2 64 bit Dispatch ID of the
1903 (enable_sgpr_dispatch_id) dispatch packet being
1905 then Flat Scratch Init 2 This is 2 SGPRs:
1906 (enable_sgpr_flat_scratch
1910 The first SGPR is a 32 bit
1912 ``SH_HIDDEN_PRIVATE_BASE_VIMID``
1913 to per SPI base of memory
1914 for scratch for the queue
1915 executing the kernel
1916 dispatch. CP obtains this
1917 from the runtime. (The
1918 Scratch Segment Buffer base
1920 ``SH_HIDDEN_PRIVATE_BASE_VIMID``
1921 plus this offset.) The value
1922 of Scratch Wave Offset must
1923 be added to this offset by
1924 the kernel machine code,
1925 right shifted by 8, and
1926 moved to the FLAT_SCRATCH_HI
1928 FLAT_SCRATCH_HI corresponds
1929 to SGPRn-4 on GFX7, and
1930 SGPRn-6 on GFX8 (where SGPRn
1931 is the highest numbered SGPR
1932 allocated to the wave).
1934 multiplied by 256 (as it is
1935 in units of 256 bytes) and
1937 ``SH_HIDDEN_PRIVATE_BASE_VIMID``
1938 to calculate the per wave
1939 FLAT SCRATCH BASE in flat
1940 memory instructions that
1944 The second SGPR is 32 bit
1945 byte size of a single
1946 work-item’s scratch memory
1947 usage. CP obtains this from
1948 the runtime, and it is
1949 always a multiple of DWORD.
1950 CP checks that the value in
1951 the kernel dispatch packet
1952 Private Segment Byte Size is
1953 not larger, and requests the
1954 runtime to increase the
1955 queue's scratch size if
1956 necessary. The kernel code
1958 FLAT_SCRATCH_LO which is
1959 SGPRn-3 on GFX7 and SGPRn-5
1960 on GFX8. FLAT_SCRATCH_LO is
1961 used as the FLAT SCRATCH
1963 instructions. Having CP load
1964 it once avoids loading it at
1965 the beginning of every
1966 wavefront. GFX9 This is the
1967 64 bit base address of the
1968 per SPI scratch backing
1969 memory managed by SPI for
1970 the queue executing the
1971 kernel dispatch. CP obtains
1972 this from the runtime (and
1973 divides it if there are
1974 multiple Shader Arrays each
1975 with its own SPI). The value
1976 of Scratch Wave Offset must
1977 be added by the kernel
1978 machine code and the result
1979 moved to the FLAT_SCRATCH
1980 SGPR which is SGPRn-6 and
1981 SGPRn-5. It is used as the
1982 FLAT SCRATCH BASE in flat
1983 memory instructions. then
1984 Private Segment Size 1 The
1985 32 bit byte size of a
1986 (enable_sgpr_private single
1988 scratch_segment_size) memory
1989 allocation. This is the
1990 value from the kernel
1991 dispatch packet Private
1992 Segment Byte Size rounded up
1993 by CP to a multiple of
1996 Having CP load it once avoids
1997 loading it at the beginning of
2000 This is not used for
2001 GFX7-GFX8 since it is the same
2002 value as the second SGPR of
2003 Flat Scratch Init. However, it
2004 may be needed for GFX9 which
2005 changes the meaning of the
2006 Flat Scratch Init value.
2007 then Grid Work-Group Count X 1 32 bit count of the number of
2008 (enable_sgpr_grid work-groups in the X dimension
2009 _workgroup_count_X) for the grid being
2010 executed. Computed from the
2011 fields in the kernel dispatch
2012 packet as ((grid_size.x +
2013 workgroup_size.x - 1) /
2015 then Grid Work-Group Count Y 1 32 bit count of the number of
2016 (enable_sgpr_grid work-groups in the Y dimension
2017 _workgroup_count_Y && for the grid being
2018 less than 16 previous executed. Computed from the
2019 SGPRs) fields in the kernel dispatch
2020 packet as ((grid_size.y +
2021 workgroup_size.y - 1) /
2024 Only initialized if <16
2025 previous SGPRs initialized.
2026 then Grid Work-Group Count Z 1 32 bit count of the number of
2027 (enable_sgpr_grid work-groups in the Z dimension
2028 _workgroup_count_Z && for the grid being
2029 less than 16 previous executed. Computed from the
2030 SGPRs) fields in the kernel dispatch
2031 packet as ((grid_size.z +
2032 workgroup_size.z - 1) /
2035 Only initialized if <16
2036 previous SGPRs initialized.
2037 then Work-Group Id X 1 32 bit work-group id in X
2038 (enable_sgpr_workgroup_id dimension of grid for
2040 then Work-Group Id Y 1 32 bit work-group id in Y
2041 (enable_sgpr_workgroup_id dimension of grid for
2043 then Work-Group Id Z 1 32 bit work-group id in Z
2044 (enable_sgpr_workgroup_id dimension of grid for
2046 then Work-Group Info 1 {first_wave, 14’b0000,
2047 (enable_sgpr_workgroup ordered_append_term[10:0],
2048 _info) threadgroup_size_in_waves[5:0]}
2049 then Scratch Wave Offset 1 32 bit byte offset from base
2050 (enable_sgpr_private of scratch base of queue
2051 _segment_wave_offset) executing the kernel
2052 dispatch. Must be used as an
2054 segment address when using
2055 Scratch Segment Buffer. It
2056 must be used to set up FLAT
2057 SCRATCH for flat addressing
2059 :ref:`amdgpu-amdhsa-flat-scratch`).
2060 ========== ========================== ====== ==============================
2062 The order of the VGPR registers is defined, but the compiler can specify which
2063 ones are actually setup in the kernel descriptor using the ``enable_vgpr*`` bit
2064 fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used
2065 for enabled registers are dense starting at VGPR0: the first enabled register is
2066 VGPR0, the next enabled register is VGPR1 etc.; disabled registers do not have a
2069 VGPR register initial state is defined in
2070 :ref:`amdgpu-amdhsa-vgpr-register-set-up-order-table`.
2072 .. table:: VGPR Register Set Up Order
2073 :name: amdgpu-amdhsa-vgpr-register-set-up-order-table
2075 ========== ========================== ====== ==============================
2076 VGPR Order Name Number Description
2077 (kernel descriptor enable of
2079 ========== ========================== ====== ==============================
2080 First Work-Item Id X 1 32 bit work item id in X
2081 (Always initialized) dimension of work-group for
2083 then Work-Item Id Y 1 32 bit work item id in Y
2084 (enable_vgpr_workitem_id dimension of work-group for
2085 > 0) wavefront lane.
2086 then Work-Item Id Z 1 32 bit work item id in Z
2087 (enable_vgpr_workitem_id dimension of work-group for
2088 > 1) wavefront lane.
2089 ========== ========================== ====== ==============================
2091 The setting of registers is is done by GPU CP/ADC/SPI hardware as follows:
2093 1. SGPRs before the Work-Group Ids are set by CP using the 16 User Data
2095 2. Work-group Id registers X, Y, Z are set by ADC which supports any
2096 combination including none.
2097 3. Scratch Wave Offset is set by SPI in a per wave basis which is why its value
2098 cannot included with the flat scratch init value which is per queue.
2099 4. The VGPRs are set by SPI which only supports specifying either (X), (X, Y)
2102 Flat Scratch register pair are adjacent SGRRs so they can be moved as a 64 bit
2103 value to the hardware required SGPRn-3 and SGPRn-4 respectively.
2105 The global segment can be accessed either using buffer instructions (GFX6 which
2106 has V# 64 bit address support), flat instructions (GFX7-9), or global
2107 instructions (GFX9).
2109 If buffer operations are used then the compiler can generate a V# with the
2110 following properties:
2114 * ATC: 1 if IOMMU present (such as APU)
2116 * MTYPE set to support memory coherence that matches the runtime (such as CC for
2117 APU and NC for dGPU).
2119 .. _amdgpu-amdhsa-kernel-prolog:
2124 .. _amdgpu-amdhsa-m0:
2130 The M0 register must be initialized with a value at least the total LDS size
2131 if the kernel may access LDS via DS or flat operations. Total LDS size is
2132 available in dispatch packet. For M0, it is also possible to use maximum
2133 possible value of LDS for given target (0x7FFF for GFX6 and 0xFFFF for
2136 The M0 register is not used for range checking LDS accesses and so does not
2137 need to be initialized in the prolog.
2139 .. _amdgpu-amdhsa-flat-scratch:
2144 If the kernel may use flat operations to access scratch memory, the prolog code
2145 must set up FLAT_SCRATCH register pair (FLAT_SCRATCH_LO/FLAT_SCRATCH_HI which
2146 are in SGPRn-4/SGPRn-3). Initialization uses Flat Scratch Init and Scratch Wave
2147 Offset SGPR registers (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`):
2150 Flat scratch is not supported.
2153 1. The low word of Flat Scratch Init is 32 bit byte offset from
2154 ``SH_HIDDEN_PRIVATE_BASE_VIMID`` to the base of scratch backing memory
2155 being managed by SPI for the queue executing the kernel dispatch. This is
2156 the same value used in the Scratch Segment Buffer V# base address. The
2157 prolog must add the value of Scratch Wave Offset to get the wave's byte
2158 scratch backing memory offset from ``SH_HIDDEN_PRIVATE_BASE_VIMID``. Since
2159 FLAT_SCRATCH_LO is in units of 256 bytes, the offset must be right shifted
2160 by 8 before moving into FLAT_SCRATCH_LO.
2161 2. The second word of Flat Scratch Init is 32 bit byte size of a single
2162 work-items scratch memory usage. This is directly loaded from the kernel
2163 dispatch packet Private Segment Byte Size and rounded up to a multiple of
2164 DWORD. Having CP load it once avoids loading it at the beginning of every
2165 wavefront. The prolog must move it to FLAT_SCRATCH_LO for use as FLAT SCRATCH
2168 The Flat Scratch Init is the 64 bit address of the base of scratch backing
2169 memory being managed by SPI for the queue executing the kernel dispatch. The
2170 prolog must add the value of Scratch Wave Offset and moved to the FLAT_SCRATCH
2171 pair for use as the flat scratch base in flat memory instructions.
2173 .. _amdgpu-amdhsa-memory-model:
2178 This section describes the mapping of LLVM memory model onto AMDGPU machine code
2179 (see :ref:`memmodel`). *The implementation is WIP.*
2182 Update when implementation complete.
2184 Support more relaxed OpenCL memory model to be controlled by environment
2185 component of target triple.
2187 The AMDGPU backend supports the memory synchronization scopes specified in
2188 :ref:`amdgpu-memory-scopes`.
2190 The code sequences used to implement the memory model are defined in table
2191 :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table`.
2193 The sequences specify the order of instructions that a single thread must
2194 execute. The ``s_waitcnt`` and ``buffer_wbinvl1_vol`` are defined with respect
2195 to other memory instructions executed by the same thread. This allows them to be
2196 moved earlier or later which can allow them to be combined with other instances
2197 of the same instruction, or hoisted/sunk out of loops to improve
2198 performance. Only the instructions related to the memory model are given;
2199 additional ``s_waitcnt`` instructions are required to ensure registers are
2200 defined before being used. These may be able to be combined with the memory
2201 model ``s_waitcnt`` instructions as described above.
2203 The AMDGPU memory model supports both the HSA [HSA]_ memory model, and the
2204 OpenCL [OpenCL]_ memory model. The HSA memory model uses a single happens-before
2205 relation for all address spaces (see :ref:`amdgpu-address-spaces`). The OpenCL
2206 memory model which has separate happens-before relations for the global and
2207 local address spaces, and only a fence specifying both global and local address
2208 space joins the relationships. Since the LLVM ``memfence`` instruction does not
2209 allow an address space to be specified the OpenCL fence has to convervatively
2210 assume both local and global address space was specified. However, optimizations
2211 can often be done to eliminate the additional ``s_waitcnt``instructions when
2212 there are no intervening corresponding ``ds/flat_load/store/atomic`` memory
2213 instructions. The code sequences in the table indicate what can be omitted for
2214 the OpenCL memory. The target triple environment is used to determine if the
2215 source language is OpenCL (see :ref:`amdgpu-opencl`).
2217 ``ds/flat_load/store/atomic`` instructions to local memory are termed LDS
2220 ``buffer/global/flat_load/store/atomic`` instructions to global memory are
2221 termed vector memory operations.
2225 * Each agent has multiple compute units (CU).
2226 * Each CU has multiple SIMDs that execute wavefronts.
2227 * The wavefronts for a single work-group are executed in the same CU but may be
2228 executed by different SIMDs.
2229 * Each CU has a single LDS memory shared by the wavefronts of the work-groups
2231 * All LDS operations of a CU are performed as wavefront wide operations in a
2232 global order and involve no caching. Completion is reported to a wavefront in
2234 * The LDS memory has multiple request queues shared by the SIMDs of a
2235 CU. Therefore, the LDS operations performed by different waves of a work-group
2236 can be reordered relative to each other, which can result in reordering the
2237 visibility of vector memory operations with respect to LDS operations of other
2238 wavefronts in the same work-group. A ``s_waitcnt lgkmcnt(0)`` is required to
2239 ensure synchronization between LDS operations and vector memory operations
2240 between waves of a work-group, but not between operations performed by the
2242 * The vector memory operations are performed as wavefront wide operations and
2243 completion is reported to a wavefront in execution order. The exception is
2244 that for GFX7-9 ``flat_load/store/atomic`` instructions can report out of
2245 vector memory order if they access LDS memory, and out of LDS operation order
2246 if they access global memory.
2247 * The vector memory operations access a vector L1 cache shared by all wavefronts
2248 on a CU. Therefore, no special action is required for coherence between
2249 wavefronts in the same work-group. A ``buffer_wbinvl1_vol`` is required for
2250 coherence between waves executing in different work-groups as they may be
2251 executing on different CUs.
2252 * The scalar memory operations access a scalar L1 cache shared by all wavefronts
2253 on a group of CUs. The scalar and vector L1 caches are not coherent. However,
2254 scalar operations are used in a restricted way so do not impact the memory
2255 model. See :ref:`amdgpu-amdhsa-memory-spaces`.
2256 * The vector and scalar memory operations use an L2 cache shared by all CUs on
2258 * The L2 cache has independent channels to service disjoint ranges of virtual
2260 * Each CU has a separate request queue per channel. Therefore, the vector and
2261 scalar memory operations performed by waves executing in different work-groups
2262 (which may be executing on different CUs) of an agent can be reordered
2263 relative to each other. A ``s_waitcnt vmcnt(0)`` is required to ensure
2264 synchronization between vector memory operations of different CUs. It ensures a
2265 previous vector memory operation has completed before executing a subsequent
2266 vector memory or LDS operation and so can be used to meet the requirements of
2267 acquire and release.
2268 * The L2 cache can be kept coherent with other agents on some targets, or ranges
2269 of virtual addresses can be set up to bypass it to ensure system coherence.
2271 Private address space uses ``buffer_load/store`` using the scratch V# (GFX6-8),
2272 or ``scratch_load/store`` (GFX9). Since only a single thread is accessing the
2273 memory, atomic memory orderings are not meaningful and all accesses are treated
2276 Constant address space uses ``buffer/global_load`` instructions (or equivalent
2277 scalar memory instructions). Since the constant address space contents do not
2278 change during the execution of a kernel dispatch it is not legal to perform
2279 stores, and atomic memory orderings are not meaningful and all access are
2280 treated as non-atomic.
2282 A memory synchronization scope wider than work-group is not meaningful for the
2283 group (LDS) address space and is treated as work-group.
2285 The memory model does not support the region address space which is treated as
2288 Acquire memory ordering is not meaningful on store atomic instructions and is
2289 treated as non-atomic.
2291 Release memory ordering is not meaningful on load atomic instructions and is
2292 treated a non-atomic.
2294 Acquire-release memory ordering is not meaningful on load or store atomic
2295 instructions and is treated as acquire and release respectively.
2297 AMDGPU backend only uses scalar memory operations to access memory that is
2298 proven to not change during the execution of the kernel dispatch. This includes
2299 constant address space and global address space for program scope const
2300 variables. Therefore the kernel machine code does not have to maintain the
2301 scalar L1 cache to ensure it is coherent with the vector L1 cache. The scalar
2302 and vector L1 caches are invalidated between kernel dispatches by CP since
2303 constant address space data may change between kernel dispatch executions. See
2304 :ref:`amdgpu-amdhsa-memory-spaces`.
2306 The one execption is if scalar writes are used to spill SGPR registers. In this
2307 case the AMDGPU backend ensures the memory location used to spill is never
2308 accessed by vector memory operations at the same time. If scalar writes are used
2309 then a ``s_dcache_wb`` is inserted before the ``s_endpgm`` and before a function
2310 return since the locations may be used for vector memory instructions by a
2311 future wave that uses the same scratch area, or a function call that creates a
2312 frame at the same address, respectively. There is no need for a ``s_dcache_inv``
2313 as all scalar writes are write-before-read in the same thread.
2315 Scratch backing memory (which is used for the private address space) is accessed
2316 with MTYPE NC_NV (non-coherenent non-volatile). Since the private address space
2317 is only accessed by a single thread, and is always write-before-read,
2318 there is never a need to invalidate these entries from the L1 cache. Hence all
2319 cache invalidates are done as ``*_vol`` to only invalidate the volatile cache
2322 On dGPU the kernarg backing memory is accessed as UC (uncached) to avoid needing
2323 to invalidate the L2 cache. This also causes it to be treated as non-volatile
2324 and so is not invalidated by ``*_vol``. On APU it is accessed as CC (cache
2325 coherent) and so the L2 cache will coherent with the CPU and other agents.
2327 .. table:: AMDHSA Memory Model Code Sequences GFX6-GFX9
2328 :name: amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table
2330 ============ ============ ============== ========== =======================
2331 LLVM Instr LLVM Memory LLVM Memory AMDGPU AMDGPU Machine Code
2332 Ordering Sync Scope Address
2334 ============ ============ ============== ========== =======================
2336 ---------------------------------------------------------------------------
2337 load *none* *none* - global non-volatile
2338 - generic 1. buffer/global/flat_load
2340 1. buffer/global/flat_load
2342 load *none* *none* - local 1. ds_load
2343 store *none* *none* - global 1. buffer/global/flat_store
2345 store *none* *none* - local 1. ds_store
2346 **Unordered Atomic**
2347 ---------------------------------------------------------------------------
2348 load atomic unordered *any* *any* *Same as non-atomic*.
2349 store atomic unordered *any* *any* *Same as non-atomic*.
2350 atomicrmw unordered *any* *any* *Same as monotonic
2352 **Monotonic Atomic**
2353 ---------------------------------------------------------------------------
2354 load atomic monotonic - singlethread - global 1. buffer/global/flat_load
2355 - wavefront - generic
2357 load atomic monotonic - singlethread - local 1. ds_load
2360 load atomic monotonic - agent - global 1. buffer/global/flat_load
2361 - system - generic glc=1
2362 store atomic monotonic - singlethread - global 1. buffer/global/flat_store
2363 - wavefront - generic
2367 store atomic monotonic - singlethread - local 1. ds_store
2370 atomicrmw monotonic - singlethread - global 1. buffer/global/flat_atomic
2371 - wavefront - generic
2375 atomicrmw monotonic - singlethread - local 1. ds_atomic
2379 ---------------------------------------------------------------------------
2380 load atomic acquire - singlethread - global 1. buffer/global/ds/flat_load
2383 load atomic acquire - workgroup - global 1. buffer/global_load
2384 load atomic acquire - workgroup - local 1. ds/flat_load
2385 - generic 2. s_waitcnt lgkmcnt(0)
2389 - Must happen before
2402 load atomic acquire - agent - global 1. buffer/global_load
2404 2. s_waitcnt vmcnt(0)
2406 - Must happen before
2414 3. buffer_wbinvl1_vol
2416 - Must happen before
2426 load atomic acquire - agent - generic 1. flat_load glc=1
2427 - system 2. s_waitcnt vmcnt(0) &
2432 - Must happen before
2435 - Ensures the flat_load
2440 3. buffer_wbinvl1_vol
2442 - Must happen before
2452 atomicrmw acquire - singlethread - global 1. buffer/global/ds/flat_atomic
2455 atomicrmw acquire - workgroup - global 1. buffer/global_atomic
2456 atomicrmw acquire - workgroup - local 1. ds/flat_atomic
2457 - generic 2. waitcnt lgkmcnt(0)
2461 - Must happen before
2474 atomicrmw acquire - agent - global 1. buffer/global_atomic
2475 - system 2. s_waitcnt vmcnt(0)
2477 - Must happen before
2486 3. buffer_wbinvl1_vol
2488 - Must happen before
2498 atomicrmw acquire - agent - generic 1. flat_atomic
2499 - system 2. s_waitcnt vmcnt(0) &
2504 - Must happen before
2513 3. buffer_wbinvl1_vol
2515 - Must happen before
2525 fence acquire - singlethread *none* *none*
2527 fence acquire - workgroup *none* 1. s_waitcnt lgkmcnt(0)
2558 fence-paired-atomic).
2559 - Must happen before
2570 fence-paired-atomic.
2572 fence acquire - agent *none* 1. s_waitcnt vmcnt(0) &
2587 - Could be split into
2596 - s_waitcnt vmcnt(0)
2607 fence-paired-atomic).
2608 - s_waitcnt lgkmcnt(0)
2619 fence-paired-atomic).
2620 - Must happen before
2634 fence-paired-atomic.
2636 2. buffer_wbinvl1_vol
2638 - Must happen before
2639 any following global/generic
2649 ---------------------------------------------------------------------------
2650 store atomic release - singlethread - global 1. buffer/global/ds/flat_store
2653 store atomic release - workgroup - global 1. s_waitcnt lgkmcnt(0)
2663 - Must happen before
2674 2. buffer/global/flat_store
2675 store atomic release - workgroup - local 1. ds_store
2676 store atomic release - agent - global 1. s_waitcnt vmcnt(0) &
2677 - system - generic lgkmcnt(0)
2681 - Could be split into
2690 - s_waitcnt vmcnt(0)
2697 - s_waitcnt lgkmcnt(0)
2704 - Must happen before
2715 2. buffer/global/ds/flat_store
2716 atomicrmw release - singlethread - global 1. buffer/global/ds/flat_atomic
2719 atomicrmw release - workgroup - global 1. s_waitcnt lgkmcnt(0)
2729 - Must happen before
2740 2. buffer/global/flat_atomic
2741 atomicrmw release - workgroup - local 1. ds_atomic
2742 atomicrmw release - agent - global 1. s_waitcnt vmcnt(0) &
2743 - system - generic lgkmcnt(0)
2747 - Could be split into
2756 - s_waitcnt vmcnt(0)
2763 - s_waitcnt lgkmcnt(0)
2770 - Must happen before
2781 2. buffer/global/ds/flat_atomic*
2782 fence release - singlethread *none* *none*
2784 fence release - workgroup *none* 1. s_waitcnt lgkmcnt(0)
2804 - Must happen before
2813 fence-paired-atomic).
2820 fence-paired-atomic.
2822 fence release - agent *none* 1. s_waitcnt vmcnt(0) &
2837 - Could be split into
2846 - s_waitcnt vmcnt(0)
2853 - s_waitcnt lgkmcnt(0)
2860 - Must happen before
2869 fence-paired-atomic).
2876 fence-paired-atomic.
2878 **Acquire-Release Atomic**
2879 ---------------------------------------------------------------------------
2880 atomicrmw acq_rel - singlethread - global 1. buffer/global/ds/flat_atomic
2883 atomicrmw acq_rel - workgroup - global 1. s_waitcnt lgkmcnt(0)
2893 - Must happen before
2904 2. buffer/global_atomic
2905 atomicrmw acq_rel - workgroup - local 1. ds_atomic
2906 2. s_waitcnt lgkmcnt(0)
2910 - Must happen before
2923 atomicrmw acq_rel - workgroup - generic 1. s_waitcnt lgkmcnt(0)
2933 - Must happen before
2945 3. s_waitcnt lgkmcnt(0)
2949 - Must happen before
2961 atomicrmw acq_rel - agent - global 1. s_waitcnt vmcnt(0) &
2966 - Could be split into
2975 - s_waitcnt vmcnt(0)
2982 - s_waitcnt lgkmcnt(0)
2989 - Must happen before
3000 2. buffer/global_atomic
3001 3. s_waitcnt vmcnt(0)
3003 - Must happen before
3012 4. buffer_wbinvl1_vol
3014 - Must happen before
3024 atomicrmw acq_rel - agent - generic 1. s_waitcnt vmcnt(0) &
3029 - Could be split into
3038 - s_waitcnt vmcnt(0)
3045 - s_waitcnt lgkmcnt(0)
3052 - Must happen before
3064 3. s_waitcnt vmcnt(0) &
3069 - Must happen before
3078 4. buffer_wbinvl1_vol
3080 - Must happen before
3090 fence acq_rel - singlethread *none* *none*
3092 fence acq_rel - workgroup *none* 1. s_waitcnt lgkmcnt(0)
3112 - Must happen before
3135 fence-paired-atomic)
3156 fence-paired-atomic).
3161 fence acq_rel - agent *none* 1. s_waitcnt vmcnt(0) &
3176 - Could be split into
3185 - s_waitcnt vmcnt(0)
3192 - s_waitcnt lgkmcnt(0)
3199 - Must happen before
3204 global/local/generic
3213 fence-paired-atomic)
3225 global/local/generic
3234 fence-paired-atomic).
3239 2. buffer_wbinvl1_vol
3241 - Must happen before
3255 **Sequential Consistent Atomic**
3256 ---------------------------------------------------------------------------
3257 load atomic seq_cst - singlethread - global *Same as corresponding
3258 - wavefront - local load atomic acquire*.
3259 - workgroup - generic
3260 load atomic seq_cst - agent - global 1. s_waitcnt vmcnt(0)
3262 - generic - Must happen after
3309 instructions same as
3313 store atomic seq_cst - singlethread - global *Same as corresponding
3314 - wavefront - local store atomic release*.
3315 - workgroup - generic
3316 store atomic seq_cst - agent - global *Same as corresponding
3317 - system - generic store atomic release*.
3318 atomicrmw seq_cst - singlethread - global *Same as corresponding
3319 - wavefront - local atomicrmw acq_rel*.
3320 - workgroup - generic
3321 atomicrmw seq_cst - agent - global *Same as corresponding
3322 - system - generic atomicrmw acq_rel*.
3323 fence seq_cst - singlethread *none* *Same as corresponding
3324 - wavefront fence acq_rel*.
3328 ============ ============ ============== ========== =======================
3330 The memory order also adds the single thread optimization constrains defined in
3332 :ref:`amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-gfx6-gfx9-table`.
3334 .. table:: AMDHSA Memory Model Single Thread Optimization Constraints GFX6-GFX9
3335 :name: amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-gfx6-gfx9-table
3337 ============ ==============================================================
3338 LLVM Memory Optimization Constraints
3340 ============ ==============================================================
3343 acquire - If a load atomic/atomicrmw then no following load/load
3344 atomic/store/ store atomic/atomicrmw/fence instruction can
3345 be moved before the acquire.
3346 - If a fence then same as load atomic, plus no preceding
3347 associated fence-paired-atomic can be moved after the fence.
3348 release - If a store atomic/atomicrmw then no preceding load/load
3349 atomic/store/ store atomic/atomicrmw/fence instruction can
3350 be moved after the release.
3351 - If a fence then same as store atomic, plus no following
3352 associated fence-paired-atomic can be moved before the
3354 acq_rel Same constraints as both acquire and release.
3355 seq_cst - If a load atomic then same constraints as acquire, plus no
3356 preceding sequentially consistent load atomic/store
3357 atomic/atomicrmw/fence instruction can be moved after the
3359 - If a store atomic then the same constraints as release, plus
3360 no following sequentially consistent load atomic/store
3361 atomic/atomicrmw/fence instruction can be moved before the
3363 - If an atomicrmw/fence then same constraints as acq_rel.
3364 ============ ==============================================================
3369 For code objects generated by AMDGPU backend for HSA [HSA]_ compatible runtimes
3370 (such as ROCm [AMD-ROCm]_), the runtime installs a trap handler that supports
3371 the ``s_trap`` instruction with the following usage:
3373 .. table:: AMDGPU Trap Handler for AMDHSA OS
3374 :name: amdgpu-trap-handler-for-amdhsa-os-table
3376 =================== =============== =============== =======================
3377 Usage Code Sequence Trap Handler Description
3379 =================== =============== =============== =======================
3380 reserved ``s_trap 0x00`` Reserved by hardware.
3381 ``debugtrap(arg)`` ``s_trap 0x01`` ``SGPR0-1``: Reserved for HSA
3382 ``queue_ptr`` ``debugtrap``
3383 ``VGPR0``: intrinsic (not
3384 ``arg`` implemented).
3385 ``llvm.trap`` ``s_trap 0x02`` ``SGPR0-1``: Causes dispatch to be
3386 ``queue_ptr`` terminated and its
3387 associated queue put
3388 into the error state.
3389 ``llvm.debugtrap`` ``s_trap 0x03`` ``SGPR0-1``: If debugger not
3390 ``queue_ptr`` installed handled
3391 same as ``llvm.trap``.
3392 debugger breakpoint ``s_trap 0x07`` Reserved for debugger
3394 debugger ``s_trap 0x08`` Reserved for debugger.
3395 debugger ``s_trap 0xfe`` Reserved for debugger.
3396 debugger ``s_trap 0xff`` Reserved for debugger.
3397 =================== =============== =============== =======================
3402 This section provides code conventions used when the target triple OS is
3403 empty (see :ref:`amdgpu-target-triples`).
3408 For code objects generated by AMDGPU backend for non-amdhsa OS, the runtime does
3409 not install a trap handler. The ``llvm.trap`` and ``llvm.debugtrap``
3410 instructions are handled as follows:
3412 .. table:: AMDGPU Trap Handler for Non-AMDHSA OS
3413 :name: amdgpu-trap-handler-for-non-amdhsa-os-table
3415 =============== =============== ===========================================
3416 Usage Code Sequence Description
3417 =============== =============== ===========================================
3418 llvm.trap s_endpgm Causes wavefront to be terminated.
3419 llvm.debugtrap *none* Compiler warning given that there is no
3420 trap handler installed.
3421 =============== =============== ===========================================
3431 When generating code for the OpenCL language the target triple environment
3432 should be ``opencl`` or ``amdgizcl`` (see :ref:`amdgpu-target-triples`).
3434 When the language is OpenCL the following differences occur:
3436 1. The OpenCL memory model is used (see :ref:`amdgpu-amdhsa-memory-model`).
3437 2. The AMDGPU backend adds additional arguments to the kernel.
3438 3. Additional metadata is generated
3439 (:ref:`amdgpu-amdhsa-hsa-code-object-metadata`).
3442 Specify what affect this has. Hidden arguments added. Additional metadata
3450 When generating code for the OpenCL language the target triple environment
3451 should be ``hcc`` (see :ref:`amdgpu-target-triples`).
3453 When the language is OpenCL the following differences occur:
3455 1. The HSA memory model is used (see :ref:`amdgpu-amdhsa-memory-model`).
3458 Specify what affect this has.
3463 AMDGPU backend has LLVM-MC based assembler which is currently in development.
3464 It supports AMDGCN GFX6-GFX8.
3466 This section describes general syntax for instructions and operands. For more
3467 information about instructions, their semantics and supported combinations of
3468 operands, refer to one of instruction set architecture manuals
3469 [AMD-Souther-Islands]_, [AMD-Sea-Islands]_, [AMD-Volcanic-Islands]_ and
3472 An instruction has the following syntax (register operands are normally
3473 comma-separated while extra operands are space-separated):
3475 *<opcode> <register_operand0>, ... <extra_operand0> ...*
3480 The following syntax for register operands is supported:
3482 * SGPR registers: s0, ... or s[0], ...
3483 * VGPR registers: v0, ... or v[0], ...
3484 * TTMP registers: ttmp0, ... or ttmp[0], ...
3485 * Special registers: exec (exec_lo, exec_hi), vcc (vcc_lo, vcc_hi), flat_scratch (flat_scratch_lo, flat_scratch_hi)
3486 * Special trap registers: tba (tba_lo, tba_hi), tma (tma_lo, tma_hi)
3487 * Register pairs, quads, etc: s[2:3], v[10:11], ttmp[5:6], s[4:7], v[12:15], ttmp[4:7], s[8:15], ...
3488 * Register lists: [s0, s1], [ttmp0, ttmp1, ttmp2, ttmp3]
3489 * Register index expressions: v[2*2], s[1-1:2-1]
3490 * 'off' indicates that an operand is not enabled
3492 The following extra operands are supported:
3494 * offset, offset0, offset1
3496 * glc, slc, tfe bits
3497 * waitcnt: integer or combination of counter values
3500 - abs (\| \|), neg (\-)
3504 - row_shl, row_shr, row_ror, row_rol
3505 - row_mirror, row_half_mirror, row_bcast
3506 - wave_shl, wave_shr, wave_ror, wave_rol, quad_perm
3507 - row_mask, bank_mask, bound_ctrl
3511 - dst_sel, src0_sel, src1_sel (BYTE_N, WORD_M, DWORD)
3512 - dst_unused (UNUSED_PAD, UNUSED_SEXT, UNUSED_PRESERVE)
3515 Instruction Examples
3516 ~~~~~~~~~~~~~~~~~~~~
3521 .. code-block:: nasm
3523 ds_add_u32 v2, v4 offset:16
3524 ds_write_src2_b64 v2 offset0:4 offset1:8
3525 ds_cmpst_f32 v2, v4, v6
3526 ds_min_rtn_f64 v[8:9], v2, v[4:5]
3529 For full list of supported instructions, refer to "LDS/GDS instructions" in ISA Manual.
3534 .. code-block:: nasm
3536 flat_load_dword v1, v[3:4]
3537 flat_store_dwordx3 v[3:4], v[5:7]
3538 flat_atomic_swap v1, v[3:4], v5 glc
3539 flat_atomic_cmpswap v1, v[3:4], v[5:6] glc slc
3540 flat_atomic_fmax_x2 v[1:2], v[3:4], v[5:6] glc
3542 For full list of supported instructions, refer to "FLAT instructions" in ISA Manual.
3547 .. code-block:: nasm
3549 buffer_load_dword v1, off, s[4:7], s1
3550 buffer_store_dwordx4 v[1:4], v2, ttmp[4:7], s1 offen offset:4 glc tfe
3551 buffer_store_format_xy v[1:2], off, s[4:7], s1
3553 buffer_atomic_inc v1, v2, s[8:11], s4 idxen offset:4 slc
3555 For full list of supported instructions, refer to "MUBUF Instructions" in ISA Manual.
3560 .. code-block:: nasm
3562 s_load_dword s1, s[2:3], 0xfc
3563 s_load_dwordx8 s[8:15], s[2:3], s4
3564 s_load_dwordx16 s[88:103], s[2:3], s4
3568 For full list of supported instructions, refer to "Scalar Memory Operations" in ISA Manual.
3573 .. code-block:: nasm
3576 s_mov_b64 s[0:1], 0x80000000
3578 s_wqm_b64 s[2:3], s[4:5]
3579 s_bcnt0_i32_b64 s1, s[2:3]
3580 s_swappc_b64 s[2:3], s[4:5]
3581 s_cbranch_join s[4:5]
3583 For full list of supported instructions, refer to "SOP1 Instructions" in ISA Manual.
3588 .. code-block:: nasm
3590 s_add_u32 s1, s2, s3
3591 s_and_b64 s[2:3], s[4:5], s[6:7]
3592 s_cselect_b32 s1, s2, s3
3593 s_andn2_b32 s2, s4, s6
3594 s_lshr_b64 s[2:3], s[4:5], s6
3595 s_ashr_i32 s2, s4, s6
3596 s_bfm_b64 s[2:3], s4, s6
3597 s_bfe_i64 s[2:3], s[4:5], s6
3598 s_cbranch_g_fork s[4:5], s[6:7]
3600 For full list of supported instructions, refer to "SOP2 Instructions" in ISA Manual.
3605 .. code-block:: nasm
3608 s_bitcmp1_b32 s1, s2
3609 s_bitcmp0_b64 s[2:3], s4
3612 For full list of supported instructions, refer to "SOPC Instructions" in ISA Manual.
3617 .. code-block:: nasm
3622 s_waitcnt 0 ; Wait for all counters to be 0
3623 s_waitcnt vmcnt(0) & expcnt(0) & lgkmcnt(0) ; Equivalent to above
3624 s_waitcnt vmcnt(1) ; Wait for vmcnt counter to be 1.
3628 s_sendmsg sendmsg(MSG_INTERRUPT)
3631 For full list of supported instructions, refer to "SOPP Instructions" in ISA Manual.
3633 Unless otherwise mentioned, little verification is performed on the operands
3634 of SOPP Instructions, so it is up to the programmer to be familiar with the
3635 range or acceptable values.
3640 For vector ALU instruction opcodes (VOP1, VOP2, VOP3, VOPC, VOP_DPP, VOP_SDWA),
3641 the assembler will automatically use optimal encoding based on its operands.
3642 To force specific encoding, one can add a suffix to the opcode of the instruction:
3644 * _e32 for 32-bit VOP1/VOP2/VOPC
3645 * _e64 for 64-bit VOP3
3647 * _sdwa for VOP_SDWA
3649 VOP1/VOP2/VOP3/VOPC examples:
3651 .. code-block:: nasm
3654 v_mov_b32_e32 v1, v2
3656 v_cvt_f64_i32_e32 v[1:2], v2
3657 v_floor_f32_e32 v1, v2
3658 v_bfrev_b32_e32 v1, v2
3659 v_add_f32_e32 v1, v2, v3
3660 v_mul_i32_i24_e64 v1, v2, 3
3661 v_mul_i32_i24_e32 v1, -3, v3
3662 v_mul_i32_i24_e32 v1, -100, v3
3663 v_addc_u32 v1, s[0:1], v2, v3, s[2:3]
3664 v_max_f16_e32 v1, v2, v3
3668 .. code-block:: nasm
3670 v_mov_b32 v0, v0 quad_perm:[0,2,1,1]
3671 v_sin_f32 v0, v0 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
3672 v_mov_b32 v0, v0 wave_shl:1
3673 v_mov_b32 v0, v0 row_mirror
3674 v_mov_b32 v0, v0 row_bcast:31
3675 v_mov_b32 v0, v0 quad_perm:[1,3,0,1] row_mask:0xa bank_mask:0x1 bound_ctrl:0
3676 v_add_f32 v0, v0, |v0| row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
3677 v_max_f16 v1, v2, v3 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
3681 .. code-block:: nasm
3683 v_mov_b32 v1, v2 dst_sel:BYTE_0 dst_unused:UNUSED_PRESERVE src0_sel:DWORD
3684 v_min_u32 v200, v200, v1 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_1 src1_sel:DWORD
3685 v_sin_f32 v0, v0 dst_unused:UNUSED_PAD src0_sel:WORD_1
3686 v_fract_f32 v0, |v0| dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1
3687 v_cmpx_le_u32 vcc, v1, v2 src0_sel:BYTE_2 src1_sel:WORD_0
3689 For full list of supported instructions, refer to "Vector ALU instructions".
3691 HSA Code Object Directives
3692 ~~~~~~~~~~~~~~~~~~~~~~~~~~
3694 AMDGPU ABI defines auxiliary data in output code object. In assembly source,
3695 one can specify them with assembler directives.
3697 .hsa_code_object_version major, minor
3698 +++++++++++++++++++++++++++++++++++++
3700 *major* and *minor* are integers that specify the version of the HSA code
3701 object that will be generated by the assembler.
3703 .hsa_code_object_isa [major, minor, stepping, vendor, arch]
3704 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
3707 *major*, *minor*, and *stepping* are all integers that describe the instruction
3708 set architecture (ISA) version of the assembly program.
3710 *vendor* and *arch* are quoted strings. *vendor* should always be equal to
3711 "AMD" and *arch* should always be equal to "AMDGPU".
3713 By default, the assembler will derive the ISA version, *vendor*, and *arch*
3714 from the value of the -mcpu option that is passed to the assembler.
3716 .amdgpu_hsa_kernel (name)
3717 +++++++++++++++++++++++++
3719 This directives specifies that the symbol with given name is a kernel entry point
3720 (label) and the object should contain corresponding symbol of type STT_AMDGPU_HSA_KERNEL.
3725 This directive marks the beginning of a list of key / value pairs that are used
3726 to specify the amd_kernel_code_t object that will be emitted by the assembler.
3727 The list must be terminated by the *.end_amd_kernel_code_t* directive. For
3728 any amd_kernel_code_t values that are unspecified a default value will be
3729 used. The default value for all keys is 0, with the following exceptions:
3731 - *kernel_code_version_major* defaults to 1.
3732 - *machine_kind* defaults to 1.
3733 - *machine_version_major*, *machine_version_minor*, and
3734 *machine_version_stepping* are derived from the value of the -mcpu option
3735 that is passed to the assembler.
3736 - *kernel_code_entry_byte_offset* defaults to 256.
3737 - *wavefront_size* defaults to 6.
3738 - *kernarg_segment_alignment*, *group_segment_alignment*, and
3739 *private_segment_alignment* default to 4. Note that alignments are specified
3740 as a power of two, so a value of **n** means an alignment of 2^ **n**.
3742 The *.amd_kernel_code_t* directive must be placed immediately after the
3743 function label and before any instructions.
3745 For a full list of amd_kernel_code_t keys, refer to AMDGPU ABI document,
3746 comments in lib/Target/AMDGPU/AmdKernelCodeT.h and test/CodeGen/AMDGPU/hsa.s.
3748 Here is an example of a minimal amd_kernel_code_t specification:
3750 .. code-block:: none
3752 .hsa_code_object_version 1,0
3753 .hsa_code_object_isa
3758 .amdgpu_hsa_kernel hello_world
3763 enable_sgpr_kernarg_segment_ptr = 1
3765 compute_pgm_rsrc1_vgprs = 0
3766 compute_pgm_rsrc1_sgprs = 0
3767 compute_pgm_rsrc2_user_sgpr = 2
3768 kernarg_segment_byte_size = 8
3769 wavefront_sgpr_count = 2
3770 workitem_vgpr_count = 3
3771 .end_amd_kernel_code_t
3773 s_load_dwordx2 s[0:1], s[0:1] 0x0
3774 v_mov_b32 v0, 3.14159
3775 s_waitcnt lgkmcnt(0)
3778 flat_store_dword v[1:2], v0
3781 .size hello_world, .Lfunc_end0-hello_world
3783 Additional Documentation
3784 ========================
3786 .. [AMD-R6xx] `AMD R6xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R600_Instruction_Set_Architecture.pdf>`__
3787 .. [AMD-R7xx] `AMD R7xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R700-Family_Instruction_Set_Architecture.pdf>`__
3788 .. [AMD-Evergreen] `AMD Evergreen shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_Evergreen-Family_Instruction_Set_Architecture.pdf>`__
3789 .. [AMD-Cayman-Trinity] `AMD Cayman/Trinity shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_HD_6900_Series_Instruction_Set_Architecture.pdf>`__
3790 .. [AMD-Souther-Islands] `AMD Southern Islands Series ISA <http://developer.amd.com/wordpress/media/2012/12/AMD_Southern_Islands_Instruction_Set_Architecture.pdf>`__
3791 .. [AMD-Sea-Islands] `AMD Sea Islands Series ISA <http://developer.amd.com/wordpress/media/2013/07/AMD_Sea_Islands_Instruction_Set_Architecture.pdf>`_
3792 .. [AMD-Volcanic-Islands] `AMD GCN3 Instruction Set Architecture <http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/12/AMD_GCN3_Instruction_Set_Architecture_rev1.1.pdf>`__
3793 .. [AMD-Vega] `AMD "Vega" Instruction Set Architecture <http://developer.amd.com/wordpress/media/2013/12/Vega_Shader_ISA_28July2017.pdf>`__
3794 .. [AMD-OpenCL_Programming-Guide] `AMD Accelerated Parallel Processing OpenCL Programming Guide <http://developer.amd.com/download/AMD_Accelerated_Parallel_Processing_OpenCL_Programming_Guide.pdf>`_
3795 .. [AMD-APP-SDK] `AMD Accelerated Parallel Processing APP SDK Documentation <http://developer.amd.com/tools/heterogeneous-computing/amd-accelerated-parallel-processing-app-sdk/documentation/>`__
3796 .. [AMD-ROCm] `ROCm: Open Platform for Development, Discovery and Education Around GPU Computing <http://gpuopen.com/compute-product/rocm/>`__
3797 .. [AMD-ROCm-github] `ROCm github <http://github.com/RadeonOpenCompute>`__
3798 .. [HSA] `Heterogeneous System Architecture (HSA) Foundation <http://www.hsafoundation.com/>`__
3799 .. [ELF] `Executable and Linkable Format (ELF) <http://www.sco.com/developers/gabi/>`__
3800 .. [DWARF] `DWARF Debugging Information Format <http://dwarfstd.org/>`__
3801 .. [YAML] `YAML Ain’t Markup Language (YAML™) Version 1.2 <http://www.yaml.org/spec/1.2/spec.html>`__
3802 .. [OpenCL] `The OpenCL Specification Version 2.0 <http://www.khronos.org/registry/cl/specs/opencl-2.0.pdf>`__
3803 .. [HRF] `Heterogeneous-race-free Memory Models <http://benedictgaster.org/wp-content/uploads/2014/01/asplos269-FINAL.pdf>`__
3804 .. [AMD-AMDGPU-Compute-Application-Binary-Interface] `AMDGPU Compute Application Binary Interface <https://github.com/RadeonOpenCompute/ROCm-ComputeABI-Doc/blob/master/AMDGPU-ABI.md>`__