1 =============================
2 User Guide for AMDGPU Backend
3 =============================
11 The AMDGPU backend provides ISA code generation for AMD GPUs, starting with the
12 R600 family up until the current GCN families. It lives in the
13 ``lib/Target/AMDGPU`` directory.
18 .. _amdgpu-target-triples:
23 Use the ``clang -target <Architecture>-<Vendor>-<OS>-<Environment>`` option to
24 specify the target triple:
26 .. table:: AMDGPU Architectures
27 :name: amdgpu-architecture-table
29 ============ ==============================================================
30 Architecture Description
31 ============ ==============================================================
32 ``r600`` AMD GPUs HD2XXX-HD6XXX for graphics and compute shaders.
33 ``amdgcn`` AMD GPUs GCN GFX6 onwards for graphics and compute shaders.
34 ============ ==============================================================
36 .. table:: AMDGPU Vendors
37 :name: amdgpu-vendor-table
39 ============ ==============================================================
41 ============ ==============================================================
42 ``amd`` Can be used for all AMD GPU usage.
43 ``mesa3d`` Can be used if the OS is ``mesa3d``.
44 ============ ==============================================================
46 .. table:: AMDGPU Operating Systems
47 :name: amdgpu-os-table
49 ============== ============================================================
51 ============== ============================================================
52 *<empty>* Defaults to the *unknown* OS.
53 ``amdhsa`` Compute kernels executed on HSA [HSA]_ compatible runtimes
54 such as AMD's ROCm [AMD-ROCm]_.
55 ``amdpal`` Graphic shaders and compute kernels executed on AMD PAL
57 ``mesa3d`` Graphic shaders and compute kernels executed on Mesa 3D
59 ============== ============================================================
61 .. table:: AMDGPU Environments
62 :name: amdgpu-environment-table
64 ============ ==============================================================
65 Environment Description
66 ============ ==============================================================
68 ============ ==============================================================
70 .. _amdgpu-processors:
75 Use the ``clang -mcpu <Processor>`` option to specify the AMD GPU processor. The
76 names from both the *Processor* and *Alternative Processor* can be used.
78 .. table:: AMDGPU Processors
79 :name: amdgpu-processor-table
81 =========== =============== ============ ===== ========= ======= ==================
82 Processor Alternative Target dGPU/ Target ROCm Example
83 Processor Triple APU Features Support Products
84 Architecture Supported
86 =========== =============== ============ ===== ========= ======= ==================
87 **Radeon HD 2000/3000 Series (R600)** [AMD-RADEON-HD-2000-3000]_
88 -----------------------------------------------------------------------------------
89 ``r600`` ``r600`` dGPU
90 ``r630`` ``r600`` dGPU
91 ``rs880`` ``r600`` dGPU
92 ``rv670`` ``r600`` dGPU
93 **Radeon HD 4000 Series (R700)** [AMD-RADEON-HD-4000]_
94 -----------------------------------------------------------------------------------
95 ``rv710`` ``r600`` dGPU
96 ``rv730`` ``r600`` dGPU
97 ``rv770`` ``r600`` dGPU
98 **Radeon HD 5000 Series (Evergreen)** [AMD-RADEON-HD-5000]_
99 -----------------------------------------------------------------------------------
100 ``cedar`` ``r600`` dGPU
101 ``cypress`` ``r600`` dGPU
102 ``juniper`` ``r600`` dGPU
103 ``redwood`` ``r600`` dGPU
104 ``sumo`` ``r600`` dGPU
105 **Radeon HD 6000 Series (Northern Islands)** [AMD-RADEON-HD-6000]_
106 -----------------------------------------------------------------------------------
107 ``barts`` ``r600`` dGPU
108 ``caicos`` ``r600`` dGPU
109 ``cayman`` ``r600`` dGPU
110 ``turks`` ``r600`` dGPU
111 **GCN GFX6 (Southern Islands (SI))** [AMD-GCN-GFX6]_
112 -----------------------------------------------------------------------------------
113 ``gfx600`` - ``tahiti`` ``amdgcn`` dGPU
114 ``gfx601`` - ``hainan`` ``amdgcn`` dGPU
118 **GCN GFX7 (Sea Islands (CI))** [AMD-GCN-GFX7]_
119 -----------------------------------------------------------------------------------
120 ``gfx700`` - ``kaveri`` ``amdgcn`` APU - A6-7000
130 ``gfx701`` - ``hawaii`` ``amdgcn`` dGPU ROCm - FirePro W8100
134 ``gfx702`` ``amdgcn`` dGPU ROCm - Radeon R9 290
138 ``gfx703`` - ``kabini`` ``amdgcn`` APU - E1-2100
139 - ``mullins`` - E1-2200
147 ``gfx704`` - ``bonaire`` ``amdgcn`` dGPU - Radeon HD 7790
151 **GCN GFX8 (Volcanic Islands (VI))** [AMD-GCN-GFX8]_
152 -----------------------------------------------------------------------------------
153 ``gfx801`` - ``carrizo`` ``amdgcn`` APU - xnack - A6-8500P
159 \ ``amdgcn`` APU - xnack ROCm - A10-8700P
162 \ ``amdgcn`` APU - xnack - A10-9600P
168 \ ``amdgcn`` APU - xnack - E2-9010
171 ``gfx802`` - ``iceland`` ``amdgcn`` dGPU - xnack ROCm - FirePro S7150
172 - ``tonga`` [off] - FirePro S7100
179 ``gfx803`` - ``fiji`` ``amdgcn`` dGPU - xnack ROCm - Radeon R9 Nano
180 [off] - Radeon R9 Fury
184 - Radeon Instinct MI8
185 \ - ``polaris10`` ``amdgcn`` dGPU - xnack ROCm - Radeon RX 470
186 [off] - Radeon RX 480
187 - Radeon Instinct MI6
188 \ - ``polaris11`` ``amdgcn`` dGPU - xnack ROCm - Radeon RX 460
190 ``gfx810`` - ``stoney`` ``amdgcn`` APU - xnack
192 **GCN GFX9** [AMD-GCN-GFX9]_
193 -----------------------------------------------------------------------------------
194 ``gfx900`` ``amdgcn`` dGPU - xnack ROCm - Radeon Vega
195 [off] Frontier Edition
200 - Radeon Instinct MI25
201 ``gfx902`` ``amdgcn`` APU - xnack - Ryzen 3 2200G
203 ``gfx904`` ``amdgcn`` dGPU - xnack *TBA*
208 ``gfx906`` ``amdgcn`` dGPU - xnack *TBA*
213 =========== =============== ============ ===== ========= ======= ==================
215 .. _amdgpu-target-features:
220 Target features control how code is generated to support certain
221 processor specific features. Not all target features are supported by
222 all processors. The runtime must ensure that the features supported by
223 the device used to execute the code match the features enabled when
224 generating the code. A mismatch of features may result in incorrect
225 execution, or a reduction in performance.
227 The target features supported by each processor, and the default value
228 used if not specified explicitly, is listed in
229 :ref:`amdgpu-processor-table`.
231 Use the ``clang -m[no-]<TargetFeature>`` option to specify the AMD GPU
237 Enable the ``xnack`` feature.
239 Disable the ``xnack`` feature.
241 .. table:: AMDGPU Target Features
242 :name: amdgpu-target-feature-table
244 ============== ==================================================
245 Target Feature Description
246 ============== ==================================================
247 -m[no-]xnack Enable/disable generating code that has
248 memory clauses that are compatible with
249 having XNACK replay enabled.
251 This is used for demand paging and page
252 migration. If XNACK replay is enabled in
253 the device, then if a page fault occurs
254 the code may execute incorrectly if the
255 ``xnack`` feature is not enabled. Executing
256 code that has the feature enabled on a
257 device that does not have XNACK replay
258 enabled will execute correctly, but may
259 be less performant than code with the
261 ============== ==================================================
263 .. _amdgpu-address-spaces:
268 The AMDGPU backend uses the following address space mappings.
270 The memory space names used in the table, aside from the region memory space, is
271 from the OpenCL standard.
273 LLVM Address Space number is used throughout LLVM (for example, in LLVM IR).
275 .. table:: Address Space Mapping
276 :name: amdgpu-address-space-mapping-table
278 ================== =================
279 LLVM Address Space Memory Space
280 ================== =================
288 ================== =================
290 .. _amdgpu-memory-scopes:
295 This section provides LLVM memory synchronization scopes supported by the AMDGPU
296 backend memory model when the target triple OS is ``amdhsa`` (see
297 :ref:`amdgpu-amdhsa-memory-model` and :ref:`amdgpu-target-triples`).
299 The memory model supported is based on the HSA memory model [HSA]_ which is
300 based in turn on HRF-indirect with scope inclusion [HRF]_. The happens-before
301 relation is transitive over the synchonizes-with relation independent of scope,
302 and synchonizes-with allows the memory scope instances to be inclusive (see
303 table :ref:`amdgpu-amdhsa-llvm-sync-scopes-table`).
305 This is different to the OpenCL [OpenCL]_ memory model which does not have scope
306 inclusion and requires the memory scopes to exactly match. However, this
307 is conservatively correct for OpenCL.
309 .. table:: AMDHSA LLVM Sync Scopes
310 :name: amdgpu-amdhsa-llvm-sync-scopes-table
312 ================ ==========================================================
313 LLVM Sync Scope Description
314 ================ ==========================================================
315 *none* The default: ``system``.
317 Synchronizes with, and participates in modification and
318 seq_cst total orderings with, other operations (except
319 image operations) for all address spaces (except private,
320 or generic that accesses private) provided the other
321 operation's sync scope is:
324 - ``agent`` and executed by a thread on the same agent.
325 - ``workgroup`` and executed by a thread in the same
327 - ``wavefront`` and executed by a thread in the same
330 ``agent`` Synchronizes with, and participates in modification and
331 seq_cst total orderings with, other operations (except
332 image operations) for all address spaces (except private,
333 or generic that accesses private) provided the other
334 operation's sync scope is:
336 - ``system`` or ``agent`` and executed by a thread on the
338 - ``workgroup`` and executed by a thread in the same
340 - ``wavefront`` and executed by a thread in the same
343 ``workgroup`` Synchronizes with, and participates in modification and
344 seq_cst total orderings with, other operations (except
345 image operations) for all address spaces (except private,
346 or generic that accesses private) provided the other
347 operation's sync scope is:
349 - ``system``, ``agent`` or ``workgroup`` and executed by a
350 thread in the same workgroup.
351 - ``wavefront`` and executed by a thread in the same
354 ``wavefront`` Synchronizes with, and participates in modification and
355 seq_cst total orderings with, other operations (except
356 image operations) for all address spaces (except private,
357 or generic that accesses private) provided the other
358 operation's sync scope is:
360 - ``system``, ``agent``, ``workgroup`` or ``wavefront``
361 and executed by a thread in the same wavefront.
363 ``singlethread`` Only synchronizes with, and participates in modification
364 and seq_cst total orderings with, other operations (except
365 image operations) running in the same thread for all
366 address spaces (for example, in signal handlers).
367 ================ ==========================================================
372 The AMDGPU backend implements the following LLVM IR intrinsics.
374 *This section is WIP.*
377 List AMDGPU intrinsics
382 The AMDGPU backend supports the following LLVM IR attributes.
384 .. table:: AMDGPU LLVM IR Attributes
385 :name: amdgpu-llvm-ir-attributes-table
387 ======================================= ==========================================================
388 LLVM Attribute Description
389 ======================================= ==========================================================
390 "amdgpu-flat-work-group-size"="min,max" Specify the minimum and maximum flat work group sizes that
391 will be specified when the kernel is dispatched. Generated
392 by the ``amdgpu_flat_work_group_size`` CLANG attribute [CLANG-ATTR]_.
393 "amdgpu-implicitarg-num-bytes"="n" Number of kernel argument bytes to add to the kernel
394 argument block size for the implicit arguments. This
395 varies by OS and language (for OpenCL see
396 :ref:`opencl-kernel-implicit-arguments-appended-for-amdhsa-os-table`).
397 "amdgpu-max-work-group-size"="n" Specify the maximum work-group size that will be specifed
398 when the kernel is dispatched.
399 "amdgpu-num-sgpr"="n" Specifies the number of SGPRs to use. Generated by
400 the ``amdgpu_num_sgpr`` CLANG attribute [CLANG-ATTR]_.
401 "amdgpu-num-vgpr"="n" Specifies the number of VGPRs to use. Generated by the
402 ``amdgpu_num_vgpr`` CLANG attribute [CLANG-ATTR]_.
403 "amdgpu-waves-per-eu"="m,n" Specify the minimum and maximum number of waves per
404 execution unit. Generated by the ``amdgpu_waves_per_eu``
405 CLANG attribute [CLANG-ATTR]_.
406 ======================================= ==========================================================
411 The AMDGPU backend generates a standard ELF [ELF]_ relocatable code object that
412 can be linked by ``lld`` to produce a standard ELF shared code object which can
413 be loaded and executed on an AMDGPU target.
418 The AMDGPU backend uses the following ELF header:
420 .. table:: AMDGPU ELF Header
421 :name: amdgpu-elf-header-table
423 ========================== ===============================
425 ========================== ===============================
426 ``e_ident[EI_CLASS]`` ``ELFCLASS64``
427 ``e_ident[EI_DATA]`` ``ELFDATA2LSB``
428 ``e_ident[EI_OSABI]`` - ``ELFOSABI_NONE``
429 - ``ELFOSABI_AMDGPU_HSA``
430 - ``ELFOSABI_AMDGPU_PAL``
431 - ``ELFOSABI_AMDGPU_MESA3D``
432 ``e_ident[EI_ABIVERSION]`` - ``ELFABIVERSION_AMDGPU_HSA``
433 - ``ELFABIVERSION_AMDGPU_PAL``
434 - ``ELFABIVERSION_AMDGPU_MESA3D``
435 ``e_type`` - ``ET_REL``
437 ``e_machine`` ``EM_AMDGPU``
439 ``e_flags`` See :ref:`amdgpu-elf-header-e_flags-table`
440 ========================== ===============================
444 .. table:: AMDGPU ELF Header Enumeration Values
445 :name: amdgpu-elf-header-enumeration-values-table
447 =============================== =====
449 =============================== =====
452 ``ELFOSABI_AMDGPU_HSA`` 64
453 ``ELFOSABI_AMDGPU_PAL`` 65
454 ``ELFOSABI_AMDGPU_MESA3D`` 66
455 ``ELFABIVERSION_AMDGPU_HSA`` 1
456 ``ELFABIVERSION_AMDGPU_PAL`` 0
457 ``ELFABIVERSION_AMDGPU_MESA3D`` 0
458 =============================== =====
460 ``e_ident[EI_CLASS]``
463 * ``ELFCLASS32`` for ``r600`` architecture.
465 * ``ELFCLASS64`` for ``amdgcn`` architecture which only supports 64
469 All AMDGPU targets use ``ELFDATA2LSB`` for little-endian byte ordering.
471 ``e_ident[EI_OSABI]``
472 One of the following AMD GPU architecture specific OS ABIs
473 (see :ref:`amdgpu-os-table`):
475 * ``ELFOSABI_NONE`` for *unknown* OS.
477 * ``ELFOSABI_AMDGPU_HSA`` for ``amdhsa`` OS.
479 * ``ELFOSABI_AMDGPU_PAL`` for ``amdpal`` OS.
481 * ``ELFOSABI_AMDGPU_MESA3D`` for ``mesa3D`` OS.
483 ``e_ident[EI_ABIVERSION]``
484 The ABI version of the AMD GPU architecture specific OS ABI to which the code
487 * ``ELFABIVERSION_AMDGPU_HSA`` is used to specify the version of AMD HSA
490 * ``ELFABIVERSION_AMDGPU_PAL`` is used to specify the version of AMD PAL
493 * ``ELFABIVERSION_AMDGPU_MESA3D`` is used to specify the version of AMD MESA
497 Can be one of the following values:
501 The type produced by the AMD GPU backend compiler as it is relocatable code
505 The type produced by the linker as it is a shared code object.
507 The AMD HSA runtime loader requires a ``ET_DYN`` code object.
510 The value ``EM_AMDGPU`` is used for the machine for all processors supported
511 by the ``r600`` and ``amdgcn`` architectures (see
512 :ref:`amdgpu-processor-table`). The specific processor is specified in the
513 ``EF_AMDGPU_MACH`` bit field of the ``e_flags`` (see
514 :ref:`amdgpu-elf-header-e_flags-table`).
517 The entry point is 0 as the entry points for individual kernels must be
518 selected in order to invoke them through AQL packets.
521 The AMDGPU backend uses the following ELF header flags:
523 .. table:: AMDGPU ELF Header ``e_flags``
524 :name: amdgpu-elf-header-e_flags-table
526 ================================= ========== =============================
527 Name Value Description
528 ================================= ========== =============================
529 **AMDGPU Processor Flag** See :ref:`amdgpu-processor-table`.
530 -------------------------------------------- -----------------------------
531 ``EF_AMDGPU_MACH`` 0x000000ff AMDGPU processor selection
533 ``EF_AMDGPU_MACH_xxx`` values
535 :ref:`amdgpu-ef-amdgpu-mach-table`.
536 ``EF_AMDGPU_XNACK`` 0x00000100 Indicates if the ``xnack``
539 contained in the code object.
546 :ref:`amdgpu-target-features`.
547 ================================= ========== =============================
549 .. table:: AMDGPU ``EF_AMDGPU_MACH`` Values
550 :name: amdgpu-ef-amdgpu-mach-table
552 ================================= ========== =============================
553 Name Value Description (see
554 :ref:`amdgpu-processor-table`)
555 ================================= ========== =============================
556 ``EF_AMDGPU_MACH_NONE`` 0x000 *not specified*
557 ``EF_AMDGPU_MACH_R600_R600`` 0x001 ``r600``
558 ``EF_AMDGPU_MACH_R600_R630`` 0x002 ``r630``
559 ``EF_AMDGPU_MACH_R600_RS880`` 0x003 ``rs880``
560 ``EF_AMDGPU_MACH_R600_RV670`` 0x004 ``rv670``
561 ``EF_AMDGPU_MACH_R600_RV710`` 0x005 ``rv710``
562 ``EF_AMDGPU_MACH_R600_RV730`` 0x006 ``rv730``
563 ``EF_AMDGPU_MACH_R600_RV770`` 0x007 ``rv770``
564 ``EF_AMDGPU_MACH_R600_CEDAR`` 0x008 ``cedar``
565 ``EF_AMDGPU_MACH_R600_CYPRESS`` 0x009 ``cypress``
566 ``EF_AMDGPU_MACH_R600_JUNIPER`` 0x00a ``juniper``
567 ``EF_AMDGPU_MACH_R600_REDWOOD`` 0x00b ``redwood``
568 ``EF_AMDGPU_MACH_R600_SUMO`` 0x00c ``sumo``
569 ``EF_AMDGPU_MACH_R600_BARTS`` 0x00d ``barts``
570 ``EF_AMDGPU_MACH_R600_CAICOS`` 0x00e ``caicos``
571 ``EF_AMDGPU_MACH_R600_CAYMAN`` 0x00f ``cayman``
572 ``EF_AMDGPU_MACH_R600_TURKS`` 0x010 ``turks``
573 *reserved* 0x011 - Reserved for ``r600``
574 0x01f architecture processors.
575 ``EF_AMDGPU_MACH_AMDGCN_GFX600`` 0x020 ``gfx600``
576 ``EF_AMDGPU_MACH_AMDGCN_GFX601`` 0x021 ``gfx601``
577 ``EF_AMDGPU_MACH_AMDGCN_GFX700`` 0x022 ``gfx700``
578 ``EF_AMDGPU_MACH_AMDGCN_GFX701`` 0x023 ``gfx701``
579 ``EF_AMDGPU_MACH_AMDGCN_GFX702`` 0x024 ``gfx702``
580 ``EF_AMDGPU_MACH_AMDGCN_GFX703`` 0x025 ``gfx703``
581 ``EF_AMDGPU_MACH_AMDGCN_GFX704`` 0x026 ``gfx704``
582 *reserved* 0x027 Reserved.
583 ``EF_AMDGPU_MACH_AMDGCN_GFX801`` 0x028 ``gfx801``
584 ``EF_AMDGPU_MACH_AMDGCN_GFX802`` 0x029 ``gfx802``
585 ``EF_AMDGPU_MACH_AMDGCN_GFX803`` 0x02a ``gfx803``
586 ``EF_AMDGPU_MACH_AMDGCN_GFX810`` 0x02b ``gfx810``
587 ``EF_AMDGPU_MACH_AMDGCN_GFX900`` 0x02c ``gfx900``
588 ``EF_AMDGPU_MACH_AMDGCN_GFX902`` 0x02d ``gfx902``
589 ``EF_AMDGPU_MACH_AMDGCN_GFX904`` 0x02e ``gfx904``
590 ``EF_AMDGPU_MACH_AMDGCN_GFX906`` 0x02f ``gfx906``
591 *reserved* 0x030 Reserved.
592 ================================= ========== =============================
597 An AMDGPU target ELF code object has the standard ELF sections which include:
599 .. table:: AMDGPU ELF Sections
600 :name: amdgpu-elf-sections-table
602 ================== ================ =================================
604 ================== ================ =================================
605 ``.bss`` ``SHT_NOBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
606 ``.data`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
607 ``.debug_``\ *\** ``SHT_PROGBITS`` *none*
608 ``.dynamic`` ``SHT_DYNAMIC`` ``SHF_ALLOC``
609 ``.dynstr`` ``SHT_PROGBITS`` ``SHF_ALLOC``
610 ``.dynsym`` ``SHT_PROGBITS`` ``SHF_ALLOC``
611 ``.got`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
612 ``.hash`` ``SHT_HASH`` ``SHF_ALLOC``
613 ``.note`` ``SHT_NOTE`` *none*
614 ``.rela``\ *name* ``SHT_RELA`` *none*
615 ``.rela.dyn`` ``SHT_RELA`` *none*
616 ``.rodata`` ``SHT_PROGBITS`` ``SHF_ALLOC``
617 ``.shstrtab`` ``SHT_STRTAB`` *none*
618 ``.strtab`` ``SHT_STRTAB`` *none*
619 ``.symtab`` ``SHT_SYMTAB`` *none*
620 ``.text`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_EXECINSTR``
621 ================== ================ =================================
623 These sections have their standard meanings (see [ELF]_) and are only generated
627 The standard DWARF sections. See :ref:`amdgpu-dwarf` for information on the
628 DWARF produced by the AMDGPU backend.
630 ``.dynamic``, ``.dynstr``, ``.dynsym``, ``.hash``
631 The standard sections used by a dynamic loader.
634 See :ref:`amdgpu-note-records` for the note records supported by the AMDGPU
637 ``.rela``\ *name*, ``.rela.dyn``
638 For relocatable code objects, *name* is the name of the section that the
639 relocation records apply. For example, ``.rela.text`` is the section name for
640 relocation records associated with the ``.text`` section.
642 For linked shared code objects, ``.rela.dyn`` contains all the relocation
643 records from each of the relocatable code object's ``.rela``\ *name* sections.
645 See :ref:`amdgpu-relocation-records` for the relocation records supported by
649 The executable machine code for the kernels and functions they call. Generated
650 as position independent code. See :ref:`amdgpu-code-conventions` for
651 information on conventions used in the isa generation.
653 .. _amdgpu-note-records:
658 As required by ``ELFCLASS32`` and ``ELFCLASS64``, minimal zero byte padding must
659 be generated after the ``name`` field to ensure the ``desc`` field is 4 byte
660 aligned. In addition, minimal zero byte padding must be generated to ensure the
661 ``desc`` field size is a multiple of 4 bytes. The ``sh_addralign`` field of the
662 ``.note`` section must be at least 4 to indicate at least 8 byte alignment.
664 The AMDGPU backend code object uses the following ELF note records in the
665 ``.note`` section. The *Description* column specifies the layout of the note
666 record's ``desc`` field. All fields are consecutive bytes. Note records with
667 variable size strings have a corresponding ``*_size`` field that specifies the
668 number of bytes, including the terminating null character, in the string. The
669 string(s) come immediately after the preceding fields.
671 Additional note records can be present.
673 .. table:: AMDGPU ELF Note Records
674 :name: amdgpu-elf-note-records-table
676 ===== ============================== ======================================
677 Name Type Description
678 ===== ============================== ======================================
679 "AMD" ``NT_AMD_AMDGPU_HSA_METADATA`` <metadata null terminated string>
680 ===== ============================== ======================================
684 .. table:: AMDGPU ELF Note Record Enumeration Values
685 :name: amdgpu-elf-note-record-enumeration-values-table
687 ============================== =====
689 ============================== =====
691 ``NT_AMD_AMDGPU_HSA_METADATA`` 10
693 ============================== =====
695 ``NT_AMD_AMDGPU_HSA_METADATA``
696 Specifies extensible metadata associated with the code objects executed on HSA
697 [HSA]_ compatible runtimes such as AMD's ROCm [AMD-ROCm]_. It is required when
698 the target triple OS is ``amdhsa`` (see :ref:`amdgpu-target-triples`). See
699 :ref:`amdgpu-amdhsa-code-object-metadata` for the syntax of the code
700 object metadata string.
707 Symbols include the following:
709 .. table:: AMDGPU ELF Symbols
710 :name: amdgpu-elf-symbols-table
712 ===================== ============== ============= ==================
713 Name Type Section Description
714 ===================== ============== ============= ==================
715 *link-name* ``STT_OBJECT`` - ``.data`` Global variable
718 *link-name*\ ``.kd`` ``STT_OBJECT`` - ``.rodata`` Kernel descriptor
719 *link-name* ``STT_FUNC`` - ``.text`` Kernel entry point
720 ===================== ============== ============= ==================
723 Global variables both used and defined by the compilation unit.
725 If the symbol is defined in the compilation unit then it is allocated in the
726 appropriate section according to if it has initialized data or is readonly.
728 If the symbol is external then its section is ``STN_UNDEF`` and the loader
729 will resolve relocations using the definition provided by another code object
730 or explicitly defined by the runtime.
732 All global symbols, whether defined in the compilation unit or external, are
733 accessed by the machine code indirectly through a GOT table entry. This
734 allows them to be preemptable. The GOT table is only supported when the target
735 triple OS is ``amdhsa`` (see :ref:`amdgpu-target-triples`).
738 Add description of linked shared object symbols. Seems undefined symbols
739 are marked as STT_NOTYPE.
742 Every HSA kernel has an associated kernel descriptor. It is the address of the
743 kernel descriptor that is used in the AQL dispatch packet used to invoke the
744 kernel, not the kernel entry point. The layout of the HSA kernel descriptor is
745 defined in :ref:`amdgpu-amdhsa-kernel-descriptor`.
748 Every HSA kernel also has a symbol for its machine code entry point.
750 .. _amdgpu-relocation-records:
755 AMDGPU backend generates ``Elf64_Rela`` relocation records. Supported
756 relocatable fields are:
759 This specifies a 32-bit field occupying 4 bytes with arbitrary byte
760 alignment. These values use the same byte order as other word values in the
761 AMD GPU architecture.
764 This specifies a 64-bit field occupying 8 bytes with arbitrary byte
765 alignment. These values use the same byte order as other word values in the
766 AMD GPU architecture.
768 Following notations are used for specifying relocation calculations:
771 Represents the addend used to compute the value of the relocatable field.
774 Represents the offset into the global offset table at which the relocation
775 entry's symbol will reside during execution.
778 Represents the address of the global offset table.
781 Represents the place (section offset for ``et_rel`` or address for ``et_dyn``)
782 of the storage unit being relocated (computed using ``r_offset``).
785 Represents the value of the symbol whose index resides in the relocation
786 entry. Relocations not using this must specify a symbol index of ``STN_UNDEF``.
789 Represents the base address of a loaded executable or shared object which is
790 the difference between the ELF address and the actual load address. Relocations
791 using this are only valid in executable or shared objects.
793 The following relocation types are supported:
795 .. table:: AMDGPU ELF Relocation Records
796 :name: amdgpu-elf-relocation-records-table
798 ========================== ======= ===== ========== ==============================
799 Relocation Type Kind Value Field Calculation
800 ========================== ======= ===== ========== ==============================
801 ``R_AMDGPU_NONE`` 0 *none* *none*
802 ``R_AMDGPU_ABS32_LO`` Static, 1 ``word32`` (S + A) & 0xFFFFFFFF
804 ``R_AMDGPU_ABS32_HI`` Static, 2 ``word32`` (S + A) >> 32
806 ``R_AMDGPU_ABS64`` Static, 3 ``word64`` S + A
808 ``R_AMDGPU_REL32`` Static 4 ``word32`` S + A - P
809 ``R_AMDGPU_REL64`` Static 5 ``word64`` S + A - P
810 ``R_AMDGPU_ABS32`` Static, 6 ``word32`` S + A
812 ``R_AMDGPU_GOTPCREL`` Static 7 ``word32`` G + GOT + A - P
813 ``R_AMDGPU_GOTPCREL32_LO`` Static 8 ``word32`` (G + GOT + A - P) & 0xFFFFFFFF
814 ``R_AMDGPU_GOTPCREL32_HI`` Static 9 ``word32`` (G + GOT + A - P) >> 32
815 ``R_AMDGPU_REL32_LO`` Static 10 ``word32`` (S + A - P) & 0xFFFFFFFF
816 ``R_AMDGPU_REL32_HI`` Static 11 ``word32`` (S + A - P) >> 32
818 ``R_AMDGPU_RELATIVE64`` Dynamic 13 ``word64`` B + A
819 ========================== ======= ===== ========== ==============================
821 ``R_AMDGPU_ABS32_LO`` and ``R_AMDGPU_ABS32_HI`` are only supported by
822 the ``mesa3d`` OS, which does not support ``R_AMDGPU_ABS64``.
824 There is no current OS loader support for 32 bit programs and so
825 ``R_AMDGPU_ABS32`` is not used.
832 Standard DWARF [DWARF]_ Version 5 sections can be generated. These contain
833 information that maps the code object executable code and data to the source
834 language constructs. It can be used by tools such as debuggers and profilers.
836 Address Space Mapping
837 ~~~~~~~~~~~~~~~~~~~~~
839 The following address space mapping is used:
841 .. table:: AMDGPU DWARF Address Space Mapping
842 :name: amdgpu-dwarf-address-space-mapping-table
844 =================== =================
845 DWARF Address Space Memory Space
846 =================== =================
851 *omitted* Generic (Flat)
852 *not supported* Region (GDS)
853 =================== =================
855 See :ref:`amdgpu-address-spaces` for information on the memory space terminology
858 An ``address_class`` attribute is generated on pointer type DIEs to specify the
859 DWARF address space of the value of the pointer when it is in the *private* or
860 *local* address space. Otherwise the attribute is omitted.
862 An ``XDEREF`` operation is generated in location list expressions for variables
863 that are allocated in the *private* and *local* address space. Otherwise no
864 ``XDREF`` is omitted.
869 *This section is WIP.*
872 Define DWARF register enumeration.
874 If want to present a wavefront state then should expose vector registers as
875 64 wide (rather than per work-item view that LLVM uses). Either as separate
876 registers, or a 64x4 byte single register. In either case use a new LANE op
877 (akin to XDREF) to select the current lane usage in a location
878 expression. This would also allow scalar register spilling to vector register
879 lanes to be expressed (currently no debug information is being generated for
880 spilling). If choose a wide single register approach then use LANE in
881 conjunction with PIECE operation to select the dword part of the register for
882 the current lane. If the separate register approach then use LANE to select
888 Source text for online-compiled programs (e.g. those compiled by the OpenCL
889 runtime) may be embedded into the DWARF v5 line table using the ``clang
890 -gembed-source`` option, described in table :ref:`amdgpu-debug-options`.
895 Enable the embedded source DWARF v5 extension.
896 ``-gno-embed-source``
897 Disable the embedded source DWARF v5 extension.
899 .. table:: AMDGPU Debug Options
900 :name: amdgpu-debug-options
902 ==================== ==================================================
903 Debug Flag Description
904 ==================== ==================================================
905 -g[no-]embed-source Enable/disable embedding source text in DWARF
906 debug sections. Useful for environments where
907 source cannot be written to disk, such as
908 when performing online compilation.
909 ==================== ==================================================
911 This option enables one extended content types in the DWARF v5 Line Number
912 Program Header, which is used to encode embedded source.
914 .. table:: AMDGPU DWARF Line Number Program Header Extended Content Types
915 :name: amdgpu-dwarf-extended-content-types
917 ============================ ======================
919 ============================ ======================
920 ``DW_LNCT_LLVM_source`` ``DW_FORM_line_strp``
921 ============================ ======================
923 The source field will contain the UTF-8 encoded, null-terminated source text
924 with ``'\n'`` line endings. When the source field is present, consumers can use
925 the embedded source instead of attempting to discover the source on disk. When
926 the source field is absent, consumers can access the file to get the source
929 The above content type appears in the ``file_name_entry_format`` field of the
930 line table prologue, and its corresponding value appear in the ``file_names``
931 field. The current encoding of the content type is documented in table
932 :ref:`amdgpu-dwarf-extended-content-types-encoding`
934 .. table:: AMDGPU DWARF Line Number Program Header Extended Content Types Encoding
935 :name: amdgpu-dwarf-extended-content-types-encoding
937 ============================ ====================
939 ============================ ====================
940 ``DW_LNCT_LLVM_source`` 0x2001
941 ============================ ====================
943 .. _amdgpu-code-conventions:
948 This section provides code conventions used for each supported target triple OS
949 (see :ref:`amdgpu-target-triples`).
954 This section provides code conventions used when the target triple OS is
955 ``amdhsa`` (see :ref:`amdgpu-target-triples`).
957 .. _amdgpu-amdhsa-code-object-target-identification:
959 Code Object Target Identification
960 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
962 The AMDHSA OS uses the following syntax to specify the code object
963 target as a single string:
965 ``<Architecture>-<Vendor>-<OS>-<Environment>-<Processor><Target Features>``
969 - ``<Architecture>``, ``<Vendor>``, ``<OS>`` and ``<Environment>``
970 are the same as the *Target Triple* (see
971 :ref:`amdgpu-target-triples`).
973 - ``<Processor>`` is the same as the *Processor* (see
974 :ref:`amdgpu-processors`).
976 - ``<Target Features>`` is a list of the enabled *Target Features*
977 (see :ref:`amdgpu-target-features`), each prefixed by a plus, that
978 apply to *Processor*. The list must be in the same order as listed
979 in the table :ref:`amdgpu-target-feature-table`. Note that *Target
980 Features* must be included in the list if they are enabled even if
981 that is the default for *Processor*.
985 ``"amdgcn-amd-amdhsa--gfx902+xnack"``
987 .. _amdgpu-amdhsa-code-object-metadata:
992 The code object metadata specifies extensible metadata associated with the code
993 objects executed on HSA [HSA]_ compatible runtimes such as AMD's ROCm
994 [AMD-ROCm]_. It is specified by the ``NT_AMD_AMDGPU_HSA_METADATA`` note record
995 (see :ref:`amdgpu-note-records`) and is required when the target triple OS is
996 ``amdhsa`` (see :ref:`amdgpu-target-triples`). It must contain the minimum
997 information necessary to support the ROCM kernel queries. For example, the
998 segment sizes needed in a dispatch packet. In addition, a high level language
999 runtime may require other information to be included. For example, the AMD
1000 OpenCL runtime records kernel argument information.
1002 The metadata is specified as a YAML formatted string (see [YAML]_ and
1006 Is the string null terminated? It probably should not if YAML allows it to
1007 contain null characters, otherwise it should be.
1009 The metadata is represented as a single YAML document comprised of the mapping
1010 defined in table :ref:`amdgpu-amdhsa-code-object-metadata-mapping-table` and
1013 For boolean values, the string values of ``false`` and ``true`` are used for
1014 false and true respectively.
1016 Additional information can be added to the mappings. To avoid conflicts, any
1017 non-AMD key names should be prefixed by "*vendor-name*.".
1019 .. table:: AMDHSA Code Object Metadata Mapping
1020 :name: amdgpu-amdhsa-code-object-metadata-mapping-table
1022 ========== ============== ========= =======================================
1023 String Key Value Type Required? Description
1024 ========== ============== ========= =======================================
1025 "Version" sequence of Required - The first integer is the major
1026 2 integers version. Currently 1.
1027 - The second integer is the minor
1028 version. Currently 0.
1029 "Printf" sequence of Each string is encoded information
1030 strings about a printf function call. The
1031 encoded information is organized as
1032 fields separated by colon (':'):
1034 ``ID:N:S[0]:S[1]:...:S[N-1]:FormatString``
1039 A 32 bit integer as a unique id for
1040 each printf function call
1043 A 32 bit integer equal to the number
1044 of arguments of printf function call
1047 ``S[i]`` (where i = 0, 1, ... , N-1)
1048 32 bit integers for the size in bytes
1049 of the i-th FormatString argument of
1050 the printf function call
1053 The format string passed to the
1054 printf function call.
1055 "Kernels" sequence of Required Sequence of the mappings for each
1056 mapping kernel in the code object. See
1057 :ref:`amdgpu-amdhsa-code-object-kernel-metadata-mapping-table`
1058 for the definition of the mapping.
1059 ========== ============== ========= =======================================
1063 .. table:: AMDHSA Code Object Kernel Metadata Mapping
1064 :name: amdgpu-amdhsa-code-object-kernel-metadata-mapping-table
1066 ================= ============== ========= ================================
1067 String Key Value Type Required? Description
1068 ================= ============== ========= ================================
1069 "Name" string Required Source name of the kernel.
1070 "SymbolName" string Required Name of the kernel
1071 descriptor ELF symbol.
1072 "Language" string Source language of the kernel.
1080 "LanguageVersion" sequence of - The first integer is the major
1082 - The second integer is the
1084 "Attrs" mapping Mapping of kernel attributes.
1086 :ref:`amdgpu-amdhsa-code-object-kernel-attribute-metadata-mapping-table`
1087 for the mapping definition.
1088 "Args" sequence of Sequence of mappings of the
1089 mapping kernel arguments. See
1090 :ref:`amdgpu-amdhsa-code-object-kernel-argument-metadata-mapping-table`
1091 for the definition of the mapping.
1092 "CodeProps" mapping Mapping of properties related to
1093 the kernel code. See
1094 :ref:`amdgpu-amdhsa-code-object-kernel-code-properties-metadata-mapping-table`
1095 for the mapping definition.
1096 ================= ============== ========= ================================
1100 .. table:: AMDHSA Code Object Kernel Attribute Metadata Mapping
1101 :name: amdgpu-amdhsa-code-object-kernel-attribute-metadata-mapping-table
1103 =================== ============== ========= ==============================
1104 String Key Value Type Required? Description
1105 =================== ============== ========= ==============================
1106 "ReqdWorkGroupSize" sequence of If not 0, 0, 0 then all values
1107 3 integers must be >=1 and the dispatch
1108 work-group size X, Y, Z must
1109 correspond to the specified
1110 values. Defaults to 0, 0, 0.
1112 Corresponds to the OpenCL
1113 ``reqd_work_group_size``
1115 "WorkGroupSizeHint" sequence of The dispatch work-group size
1116 3 integers X, Y, Z is likely to be the
1119 Corresponds to the OpenCL
1120 ``work_group_size_hint``
1122 "VecTypeHint" string The name of a scalar or vector
1125 Corresponds to the OpenCL
1126 ``vec_type_hint`` attribute.
1128 "RuntimeHandle" string The external symbol name
1129 associated with a kernel.
1130 OpenCL runtime allocates a
1131 global buffer for the symbol
1132 and saves the kernel's address
1133 to it, which is used for
1134 device side enqueueing. Only
1135 available for device side
1137 =================== ============== ========= ==============================
1141 .. table:: AMDHSA Code Object Kernel Argument Metadata Mapping
1142 :name: amdgpu-amdhsa-code-object-kernel-argument-metadata-mapping-table
1144 ================= ============== ========= ================================
1145 String Key Value Type Required? Description
1146 ================= ============== ========= ================================
1147 "Name" string Kernel argument name.
1148 "TypeName" string Kernel argument type name.
1149 "Size" integer Required Kernel argument size in bytes.
1150 "Align" integer Required Kernel argument alignment in
1151 bytes. Must be a power of two.
1152 "ValueKind" string Required Kernel argument kind that
1153 specifies how to set up the
1154 corresponding argument.
1158 The argument is copied
1159 directly into the kernarg.
1162 A global address space pointer
1163 to the buffer data is passed
1166 "DynamicSharedPointer"
1167 A group address space pointer
1168 to dynamically allocated LDS
1169 is passed in the kernarg.
1172 A global address space
1173 pointer to a S# is passed in
1177 A global address space
1178 pointer to a T# is passed in
1182 A global address space pointer
1183 to an OpenCL pipe is passed in
1187 A global address space pointer
1188 to an OpenCL device enqueue
1189 queue is passed in the
1192 "HiddenGlobalOffsetX"
1193 The OpenCL grid dispatch
1194 global offset for the X
1195 dimension is passed in the
1198 "HiddenGlobalOffsetY"
1199 The OpenCL grid dispatch
1200 global offset for the Y
1201 dimension is passed in the
1204 "HiddenGlobalOffsetZ"
1205 The OpenCL grid dispatch
1206 global offset for the Z
1207 dimension is passed in the
1211 An argument that is not used
1212 by the kernel. Space needs to
1213 be left for it, but it does
1214 not need to be set up.
1216 "HiddenPrintfBuffer"
1217 A global address space pointer
1218 to the runtime printf buffer
1219 is passed in kernarg.
1221 "HiddenDefaultQueue"
1222 A global address space pointer
1223 to the OpenCL device enqueue
1224 queue that should be used by
1225 the kernel by default is
1226 passed in the kernarg.
1228 "HiddenCompletionAction"
1229 A global address space pointer
1230 to help link enqueued kernels into
1231 the ancestor tree for determining
1232 when the parent kernel has finished.
1234 "ValueType" string Required Kernel argument value type. Only
1235 present if "ValueKind" is
1236 "ByValue". For vector data
1237 types, the value is for the
1238 element type. Values include:
1254 How can it be determined if a
1255 vector type, and what size
1257 "PointeeAlign" integer Alignment in bytes of pointee
1258 type for pointer type kernel
1259 argument. Must be a power
1260 of 2. Only present if
1262 "DynamicSharedPointer".
1263 "AddrSpaceQual" string Kernel argument address space
1264 qualifier. Only present if
1265 "ValueKind" is "GlobalBuffer" or
1266 "DynamicSharedPointer". Values
1277 Is GlobalBuffer only Global
1279 DynamicSharedPointer always
1280 Local? Can HCC allow Generic?
1281 How can Private or Region
1283 "AccQual" string Kernel argument access
1284 qualifier. Only present if
1285 "ValueKind" is "Image" or
1296 "ActualAccQual" string The actual memory accesses
1297 performed by the kernel on the
1298 kernel argument. Only present if
1299 "ValueKind" is "GlobalBuffer",
1300 "Image", or "Pipe". This may be
1301 more restrictive than indicated
1302 by "AccQual" to reflect what the
1303 kernel actual does. If not
1304 present then the runtime must
1305 assume what is implied by
1306 "AccQual" and "IsConst". Values
1313 "IsConst" boolean Indicates if the kernel argument
1314 is const qualified. Only present
1318 "IsRestrict" boolean Indicates if the kernel argument
1319 is restrict qualified. Only
1320 present if "ValueKind" is
1323 "IsVolatile" boolean Indicates if the kernel argument
1324 is volatile qualified. Only
1325 present if "ValueKind" is
1328 "IsPipe" boolean Indicates if the kernel argument
1329 is pipe qualified. Only present
1330 if "ValueKind" is "Pipe".
1333 Can GlobalBuffer be pipe
1335 ================= ============== ========= ================================
1339 .. table:: AMDHSA Code Object Kernel Code Properties Metadata Mapping
1340 :name: amdgpu-amdhsa-code-object-kernel-code-properties-metadata-mapping-table
1342 ============================ ============== ========= =====================
1343 String Key Value Type Required? Description
1344 ============================ ============== ========= =====================
1345 "KernargSegmentSize" integer Required The size in bytes of
1347 that holds the values
1350 "GroupSegmentFixedSize" integer Required The amount of group
1354 bytes. This does not
1356 dynamically allocated
1357 group segment memory
1361 "PrivateSegmentFixedSize" integer Required The amount of fixed
1362 private address space
1363 memory required for a
1365 bytes. If the kernel
1367 stack then additional
1369 to this value for the
1371 "KernargSegmentAlign" integer Required The maximum byte
1374 kernarg segment. Must
1376 "WavefrontSize" integer Required Wavefront size. Must
1378 "NumSGPRs" integer Required Number of scalar
1382 includes the special
1388 SGPR added if a trap
1394 "NumVGPRs" integer Required Number of vector
1398 "MaxFlatWorkGroupSize" integer Required Maximum flat
1401 kernel in work-items.
1404 ReqdWorkGroupSize if
1406 "NumSpilledSGPRs" integer Number of stores from
1407 a scalar register to
1408 a register allocator
1411 "NumSpilledVGPRs" integer Number of stores from
1412 a vector register to
1413 a register allocator
1416 ============================ ============== ========= =====================
1423 The HSA architected queuing language (AQL) defines a user space memory interface
1424 that can be used to control the dispatch of kernels, in an agent independent
1425 way. An agent can have zero or more AQL queues created for it using the ROCm
1426 runtime, in which AQL packets (all of which are 64 bytes) can be placed. See the
1427 *HSA Platform System Architecture Specification* [HSA]_ for the AQL queue
1428 mechanics and packet layouts.
1430 The packet processor of a kernel agent is responsible for detecting and
1431 dispatching HSA kernels from the AQL queues associated with it. For AMD GPUs the
1432 packet processor is implemented by the hardware command processor (CP),
1433 asynchronous dispatch controller (ADC) and shader processor input controller
1436 The ROCm runtime can be used to allocate an AQL queue object. It uses the kernel
1437 mode driver to initialize and register the AQL queue with CP.
1439 To dispatch a kernel the following actions are performed. This can occur in the
1440 CPU host program, or from an HSA kernel executing on a GPU.
1442 1. A pointer to an AQL queue for the kernel agent on which the kernel is to be
1443 executed is obtained.
1444 2. A pointer to the kernel descriptor (see
1445 :ref:`amdgpu-amdhsa-kernel-descriptor`) of the kernel to execute is
1446 obtained. It must be for a kernel that is contained in a code object that that
1447 was loaded by the ROCm runtime on the kernel agent with which the AQL queue is
1449 3. Space is allocated for the kernel arguments using the ROCm runtime allocator
1450 for a memory region with the kernarg property for the kernel agent that will
1451 execute the kernel. It must be at least 16 byte aligned.
1452 4. Kernel argument values are assigned to the kernel argument memory
1453 allocation. The layout is defined in the *HSA Programmer's Language Reference*
1454 [HSA]_. For AMDGPU the kernel execution directly accesses the kernel argument
1455 memory in the same way constant memory is accessed. (Note that the HSA
1456 specification allows an implementation to copy the kernel argument contents to
1457 another location that is accessed by the kernel.)
1458 5. An AQL kernel dispatch packet is created on the AQL queue. The ROCm runtime
1459 api uses 64 bit atomic operations to reserve space in the AQL queue for the
1460 packet. The packet must be set up, and the final write must use an atomic
1461 store release to set the packet kind to ensure the packet contents are
1462 visible to the kernel agent. AQL defines a doorbell signal mechanism to
1463 notify the kernel agent that the AQL queue has been updated. These rules, and
1464 the layout of the AQL queue and kernel dispatch packet is defined in the *HSA
1465 System Architecture Specification* [HSA]_.
1466 6. A kernel dispatch packet includes information about the actual dispatch,
1467 such as grid and work-group size, together with information from the code
1468 object about the kernel, such as segment sizes. The ROCm runtime queries on
1469 the kernel symbol can be used to obtain the code object values which are
1470 recorded in the :ref:`amdgpu-amdhsa-code-object-metadata`.
1471 7. CP executes micro-code and is responsible for detecting and setting up the
1472 GPU to execute the wavefronts of a kernel dispatch.
1473 8. CP ensures that when the a wavefront starts executing the kernel machine
1474 code, the scalar general purpose registers (SGPR) and vector general purpose
1475 registers (VGPR) are set up as required by the machine code. The required
1476 setup is defined in the :ref:`amdgpu-amdhsa-kernel-descriptor`. The initial
1477 register state is defined in
1478 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`.
1479 9. The prolog of the kernel machine code (see
1480 :ref:`amdgpu-amdhsa-kernel-prolog`) sets up the machine state as necessary
1481 before continuing executing the machine code that corresponds to the kernel.
1482 10. When the kernel dispatch has completed execution, CP signals the completion
1483 signal specified in the kernel dispatch packet if not 0.
1485 .. _amdgpu-amdhsa-memory-spaces:
1490 The memory space properties are:
1492 .. table:: AMDHSA Memory Spaces
1493 :name: amdgpu-amdhsa-memory-spaces-table
1495 ================= =========== ======== ======= ==================
1496 Memory Space Name HSA Segment Hardware Address NULL Value
1498 ================= =========== ======== ======= ==================
1499 Private private scratch 32 0x00000000
1500 Local group LDS 32 0xFFFFFFFF
1501 Global global global 64 0x0000000000000000
1502 Constant constant *same as 64 0x0000000000000000
1504 Generic flat flat 64 0x0000000000000000
1505 Region N/A GDS 32 *not implemented
1507 ================= =========== ======== ======= ==================
1509 The global and constant memory spaces both use global virtual addresses, which
1510 are the same virtual address space used by the CPU. However, some virtual
1511 addresses may only be accessible to the CPU, some only accessible by the GPU,
1514 Using the constant memory space indicates that the data will not change during
1515 the execution of the kernel. This allows scalar read instructions to be
1516 used. The vector and scalar L1 caches are invalidated of volatile data before
1517 each kernel dispatch execution to allow constant memory to change values between
1520 The local memory space uses the hardware Local Data Store (LDS) which is
1521 automatically allocated when the hardware creates work-groups of wavefronts, and
1522 freed when all the wavefronts of a work-group have terminated. The data store
1523 (DS) instructions can be used to access it.
1525 The private memory space uses the hardware scratch memory support. If the kernel
1526 uses scratch, then the hardware allocates memory that is accessed using
1527 wavefront lane dword (4 byte) interleaving. The mapping used from private
1528 address to physical address is:
1530 ``wavefront-scratch-base +
1531 (private-address * wavefront-size * 4) +
1532 (wavefront-lane-id * 4)``
1534 There are different ways that the wavefront scratch base address is determined
1535 by a wavefront (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). This
1536 memory can be accessed in an interleaved manner using buffer instruction with
1537 the scratch buffer descriptor and per wavefront scratch offset, by the scratch
1538 instructions, or by flat instructions. If each lane of a wavefront accesses the
1539 same private address, the interleaving results in adjacent dwords being accessed
1540 and hence requires fewer cache lines to be fetched. Multi-dword access is not
1541 supported except by flat and scratch instructions in GFX9.
1543 The generic address space uses the hardware flat address support available in
1544 GFX7-GFX9. This uses two fixed ranges of virtual addresses (the private and
1545 local appertures), that are outside the range of addressible global memory, to
1546 map from a flat address to a private or local address.
1548 FLAT instructions can take a flat address and access global, private (scratch)
1549 and group (LDS) memory depending in if the address is within one of the
1550 apperture ranges. Flat access to scratch requires hardware aperture setup and
1551 setup in the kernel prologue (see :ref:`amdgpu-amdhsa-flat-scratch`). Flat
1552 access to LDS requires hardware aperture setup and M0 (GFX7-GFX8) register setup
1553 (see :ref:`amdgpu-amdhsa-m0`).
1555 To convert between a segment address and a flat address the base address of the
1556 appertures address can be used. For GFX7-GFX8 these are available in the
1557 :ref:`amdgpu-amdhsa-hsa-aql-queue` the address of which can be obtained with
1558 Queue Ptr SGPR (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). For
1559 GFX9 the appature base addresses are directly available as inline constant
1560 registers ``SRC_SHARED_BASE/LIMIT`` and ``SRC_PRIVATE_BASE/LIMIT``. In 64 bit
1561 address mode the apperture sizes are 2^32 bytes and the base is aligned to 2^32
1562 which makes it easier to convert from flat to segment or segment to flat.
1567 Image and sample handles created by the ROCm runtime are 64 bit addresses of a
1568 hardware 32 byte V# and 48 byte S# object respectively. In order to support the
1569 HSA ``query_sampler`` operations two extra dwords are used to store the HSA BRIG
1570 enumeration values for the queries that are not trivially deducible from the S#
1576 HSA signal handles created by the ROCm runtime are 64 bit addresses of a
1577 structure allocated in memory accessible from both the CPU and GPU. The
1578 structure is defined by the ROCm runtime and subject to change between releases
1579 (see [AMD-ROCm-github]_).
1581 .. _amdgpu-amdhsa-hsa-aql-queue:
1586 The HSA AQL queue structure is defined by the ROCm runtime and subject to change
1587 between releases (see [AMD-ROCm-github]_). For some processors it contains
1588 fields needed to implement certain language features such as the flat address
1589 aperture bases. It also contains fields used by CP such as managing the
1590 allocation of scratch memory.
1592 .. _amdgpu-amdhsa-kernel-descriptor:
1597 A kernel descriptor consists of the information needed by CP to initiate the
1598 execution of a kernel, including the entry point address of the machine code
1599 that implements the kernel.
1601 Kernel Descriptor for GFX6-GFX9
1602 +++++++++++++++++++++++++++++++
1604 CP microcode requires the Kernel descriptor to be allocated on 64 byte
1607 .. table:: Kernel Descriptor for GFX6-GFX9
1608 :name: amdgpu-amdhsa-kernel-descriptor-gfx6-gfx9-table
1610 ======= ======= =============================== ============================
1611 Bits Size Field Name Description
1612 ======= ======= =============================== ============================
1613 31:0 4 bytes GROUP_SEGMENT_FIXED_SIZE The amount of fixed local
1614 address space memory
1615 required for a work-group
1616 in bytes. This does not
1617 include any dynamically
1618 allocated local address
1619 space memory that may be
1620 added when the kernel is
1622 63:32 4 bytes PRIVATE_SEGMENT_FIXED_SIZE The amount of fixed
1623 private address space
1624 memory required for a
1625 work-item in bytes. If
1626 is_dynamic_callstack is 1
1627 then additional space must
1628 be added to this value for
1630 127:64 8 bytes Reserved, must be 0.
1631 191:128 8 bytes KERNEL_CODE_ENTRY_BYTE_OFFSET Byte offset (possibly
1634 descriptor to kernel's
1635 entry point instruction
1636 which must be 256 byte
1638 383:192 24 Reserved, must be 0.
1640 415:384 4 bytes COMPUTE_PGM_RSRC1 Compute Shader (CS)
1641 program settings used by
1643 ``COMPUTE_PGM_RSRC1``
1646 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table`.
1647 447:416 4 bytes COMPUTE_PGM_RSRC2 Compute Shader (CS)
1648 program settings used by
1650 ``COMPUTE_PGM_RSRC2``
1653 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`.
1654 448 1 bit ENABLE_SGPR_PRIVATE_SEGMENT Enable the setup of the
1655 _BUFFER SGPR user data registers
1657 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1659 The total number of SGPR
1661 requested must not exceed
1662 16 and match value in
1663 ``compute_pgm_rsrc2.user_sgpr.user_sgpr_count``.
1664 Any requests beyond 16
1666 449 1 bit ENABLE_SGPR_DISPATCH_PTR *see above*
1667 450 1 bit ENABLE_SGPR_QUEUE_PTR *see above*
1668 451 1 bit ENABLE_SGPR_KERNARG_SEGMENT_PTR *see above*
1669 452 1 bit ENABLE_SGPR_DISPATCH_ID *see above*
1670 453 1 bit ENABLE_SGPR_FLAT_SCRATCH_INIT *see above*
1671 454 1 bit ENABLE_SGPR_PRIVATE_SEGMENT *see above*
1673 455 1 bit Reserved, must be 0.
1674 511:456 8 bytes Reserved, must be 0.
1675 512 **Total size 64 bytes.**
1676 ======= ====================================================================
1680 .. table:: compute_pgm_rsrc1 for GFX6-GFX9
1681 :name: amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table
1683 ======= ======= =============================== ===========================================================================
1684 Bits Size Field Name Description
1685 ======= ======= =============================== ===========================================================================
1686 5:0 6 bits GRANULATED_WORKITEM_VGPR_COUNT Number of vector register
1687 blocks used by each work-item;
1688 granularity is device
1693 - max(0, ceil(vgprs_used / 4) - 1)
1695 Where vgprs_used is defined
1696 as the highest VGPR number
1697 explicitly referenced plus
1700 Used by CP to set up
1701 ``COMPUTE_PGM_RSRC1.VGPRS``.
1704 :ref:`amdgpu-assembler`
1706 automatically for the
1707 selected processor from
1708 values provided to the
1709 `.amdhsa_kernel` directive
1711 `.amdhsa_next_free_vgpr`
1712 nested directive (see
1713 :ref:`amdhsa-kernel-directives-table`).
1714 9:6 4 bits GRANULATED_WAVEFRONT_SGPR_COUNT Number of scalar register
1715 blocks used by a wavefront;
1716 granularity is device
1721 - max(0, ceil(sgprs_used / 8) - 1)
1724 - 2 * max(0, ceil(sgprs_used / 16) - 1)
1727 defined as the highest
1728 SGPR number explicitly
1729 referenced plus one, plus
1730 a target-specific number
1731 of additional special
1733 FLAT_SCRATCH (GFX7+) and
1734 XNACK_MASK (GFX8+), and
1737 limitations. It does not
1738 include the 16 SGPRs added
1739 if a trap handler is
1743 limitations and special
1744 SGPR layout are defined in
1746 documentation, which can
1748 :ref:`amdgpu-processors`
1751 Used by CP to set up
1752 ``COMPUTE_PGM_RSRC1.SGPRS``.
1755 :ref:`amdgpu-assembler`
1757 automatically for the
1758 selected processor from
1759 values provided to the
1760 `.amdhsa_kernel` directive
1762 `.amdhsa_next_free_sgpr`
1763 and `.amdhsa_reserve_*`
1764 nested directives (see
1765 :ref:`amdhsa-kernel-directives-table`).
1766 11:10 2 bits PRIORITY Must be 0.
1768 Start executing wavefront
1769 at the specified priority.
1771 CP is responsible for
1773 ``COMPUTE_PGM_RSRC1.PRIORITY``.
1774 13:12 2 bits FLOAT_ROUND_MODE_32 Wavefront starts execution
1775 with specified rounding
1778 precision floating point
1781 Floating point rounding
1782 mode values are defined in
1783 :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
1785 Used by CP to set up
1786 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
1787 15:14 2 bits FLOAT_ROUND_MODE_16_64 Wavefront starts execution
1788 with specified rounding
1789 denorm mode for half/double (16
1790 and 64 bit) floating point
1791 precision floating point
1794 Floating point rounding
1795 mode values are defined in
1796 :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
1798 Used by CP to set up
1799 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
1800 17:16 2 bits FLOAT_DENORM_MODE_32 Wavefront starts execution
1801 with specified denorm mode
1804 precision floating point
1807 Floating point denorm mode
1808 values are defined in
1809 :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
1811 Used by CP to set up
1812 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
1813 19:18 2 bits FLOAT_DENORM_MODE_16_64 Wavefront starts execution
1814 with specified denorm mode
1816 and 64 bit) floating point
1817 precision floating point
1820 Floating point denorm mode
1821 values are defined in
1822 :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
1824 Used by CP to set up
1825 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
1826 20 1 bit PRIV Must be 0.
1828 Start executing wavefront
1829 in privilege trap handler
1832 CP is responsible for
1834 ``COMPUTE_PGM_RSRC1.PRIV``.
1835 21 1 bit ENABLE_DX10_CLAMP Wavefront starts execution
1836 with DX10 clamp mode
1837 enabled. Used by the vector
1838 ALU to force DX10 style
1839 treatment of NaN's (when
1840 set, clamp NaN to zero,
1844 Used by CP to set up
1845 ``COMPUTE_PGM_RSRC1.DX10_CLAMP``.
1846 22 1 bit DEBUG_MODE Must be 0.
1848 Start executing wavefront
1849 in single step mode.
1851 CP is responsible for
1853 ``COMPUTE_PGM_RSRC1.DEBUG_MODE``.
1854 23 1 bit ENABLE_IEEE_MODE Wavefront starts execution
1856 enabled. Floating point
1857 opcodes that support
1858 exception flag gathering
1859 will quiet and propagate
1860 signaling-NaN inputs per
1861 IEEE 754-2008. Min_dx10 and
1862 max_dx10 become IEEE
1863 754-2008 compliant due to
1864 signaling-NaN propagation
1867 Used by CP to set up
1868 ``COMPUTE_PGM_RSRC1.IEEE_MODE``.
1869 24 1 bit BULKY Must be 0.
1871 Only one work-group allowed
1872 to execute on a compute
1875 CP is responsible for
1877 ``COMPUTE_PGM_RSRC1.BULKY``.
1878 25 1 bit CDBG_USER Must be 0.
1880 Flag that can be used to
1881 control debugging code.
1883 CP is responsible for
1885 ``COMPUTE_PGM_RSRC1.CDBG_USER``.
1886 26 1 bit FP16_OVFL GFX6-GFX8
1887 Reserved, must be 0.
1889 Wavefront starts execution
1890 with specified fp16 overflow
1893 - If 0, fp16 overflow generates
1895 - If 1, fp16 overflow that is the
1896 result of an +/-INF input value
1897 or divide by 0 produces a +/-INF,
1898 otherwise clamps computed
1899 overflow to +/-MAX_FP16 as
1902 Used by CP to set up
1903 ``COMPUTE_PGM_RSRC1.FP16_OVFL``.
1904 31:27 5 bits Reserved, must be 0.
1905 32 **Total size 4 bytes**
1906 ======= ===================================================================================================================
1910 .. table:: compute_pgm_rsrc2 for GFX6-GFX9
1911 :name: amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table
1913 ======= ======= =============================== ===========================================================================
1914 Bits Size Field Name Description
1915 ======= ======= =============================== ===========================================================================
1916 0 1 bit ENABLE_SGPR_PRIVATE_SEGMENT Enable the setup of the
1917 _WAVEFRONT_OFFSET SGPR wavefront scratch offset
1918 system register (see
1919 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1921 Used by CP to set up
1922 ``COMPUTE_PGM_RSRC2.SCRATCH_EN``.
1923 5:1 5 bits USER_SGPR_COUNT The total number of SGPR
1925 requested. This number must
1926 match the number of user
1927 data registers enabled.
1929 Used by CP to set up
1930 ``COMPUTE_PGM_RSRC2.USER_SGPR``.
1931 6 1 bit ENABLE_TRAP_HANDLER Must be 0.
1934 ``COMPUTE_PGM_RSRC2.TRAP_PRESENT``,
1935 which is set by the CP if
1936 the runtime has installed a
1938 7 1 bit ENABLE_SGPR_WORKGROUP_ID_X Enable the setup of the
1939 system SGPR register for
1940 the work-group id in the X
1942 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1944 Used by CP to set up
1945 ``COMPUTE_PGM_RSRC2.TGID_X_EN``.
1946 8 1 bit ENABLE_SGPR_WORKGROUP_ID_Y Enable the setup of the
1947 system SGPR register for
1948 the work-group id in the Y
1950 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1952 Used by CP to set up
1953 ``COMPUTE_PGM_RSRC2.TGID_Y_EN``.
1954 9 1 bit ENABLE_SGPR_WORKGROUP_ID_Z Enable the setup of the
1955 system SGPR register for
1956 the work-group id in the Z
1958 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1960 Used by CP to set up
1961 ``COMPUTE_PGM_RSRC2.TGID_Z_EN``.
1962 10 1 bit ENABLE_SGPR_WORKGROUP_INFO Enable the setup of the
1963 system SGPR register for
1964 work-group information (see
1965 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1967 Used by CP to set up
1968 ``COMPUTE_PGM_RSRC2.TGID_SIZE_EN``.
1969 12:11 2 bits ENABLE_VGPR_WORKITEM_ID Enable the setup of the
1970 VGPR system registers used
1971 for the work-item ID.
1972 :ref:`amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table`
1975 Used by CP to set up
1976 ``COMPUTE_PGM_RSRC2.TIDIG_CMP_CNT``.
1977 13 1 bit ENABLE_EXCEPTION_ADDRESS_WATCH Must be 0.
1979 Wavefront starts execution
1981 exceptions enabled which
1982 are generated when L1 has
1983 witnessed a thread access
1987 CP is responsible for
1988 filling in the address
1990 ``COMPUTE_PGM_RSRC2.EXCP_EN_MSB``
1991 according to what the
1993 14 1 bit ENABLE_EXCEPTION_MEMORY Must be 0.
1995 Wavefront starts execution
1996 with memory violation
1997 exceptions exceptions
1998 enabled which are generated
1999 when a memory violation has
2000 occurred for this wavefront from
2002 (write-to-read-only-memory,
2003 mis-aligned atomic, LDS
2004 address out of range,
2005 illegal address, etc.).
2009 ``COMPUTE_PGM_RSRC2.EXCP_EN_MSB``
2010 according to what the
2012 23:15 9 bits GRANULATED_LDS_SIZE Must be 0.
2014 CP uses the rounded value
2015 from the dispatch packet,
2016 not this value, as the
2017 dispatch may contain
2018 dynamically allocated group
2019 segment memory. CP writes
2021 ``COMPUTE_PGM_RSRC2.LDS_SIZE``.
2023 Amount of group segment
2024 (LDS) to allocate for each
2025 work-group. Granularity is
2029 roundup(lds-size / (64 * 4))
2031 roundup(lds-size / (128 * 4))
2033 24 1 bit ENABLE_EXCEPTION_IEEE_754_FP Wavefront starts execution
2034 _INVALID_OPERATION with specified exceptions
2037 Used by CP to set up
2038 ``COMPUTE_PGM_RSRC2.EXCP_EN``
2039 (set from bits 0..6).
2043 25 1 bit ENABLE_EXCEPTION_FP_DENORMAL FP Denormal one or more
2044 _SOURCE input operands is a
2046 26 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Division by
2047 _DIVISION_BY_ZERO Zero
2048 27 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP FP Overflow
2050 28 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Underflow
2052 29 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Inexact
2054 30 1 bit ENABLE_EXCEPTION_INT_DIVIDE_BY Integer Division by Zero
2055 _ZERO (rcp_iflag_f32 instruction
2057 31 1 bit Reserved, must be 0.
2058 32 **Total size 4 bytes.**
2059 ======= ===================================================================================================================
2063 .. table:: Floating Point Rounding Mode Enumeration Values
2064 :name: amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table
2066 ====================================== ===== ==============================
2067 Enumeration Name Value Description
2068 ====================================== ===== ==============================
2069 FLOAT_ROUND_MODE_NEAR_EVEN 0 Round Ties To Even
2070 FLOAT_ROUND_MODE_PLUS_INFINITY 1 Round Toward +infinity
2071 FLOAT_ROUND_MODE_MINUS_INFINITY 2 Round Toward -infinity
2072 FLOAT_ROUND_MODE_ZERO 3 Round Toward 0
2073 ====================================== ===== ==============================
2077 .. table:: Floating Point Denorm Mode Enumeration Values
2078 :name: amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table
2080 ====================================== ===== ==============================
2081 Enumeration Name Value Description
2082 ====================================== ===== ==============================
2083 FLOAT_DENORM_MODE_FLUSH_SRC_DST 0 Flush Source and Destination
2085 FLOAT_DENORM_MODE_FLUSH_DST 1 Flush Output Denorms
2086 FLOAT_DENORM_MODE_FLUSH_SRC 2 Flush Source Denorms
2087 FLOAT_DENORM_MODE_FLUSH_NONE 3 No Flush
2088 ====================================== ===== ==============================
2092 .. table:: System VGPR Work-Item ID Enumeration Values
2093 :name: amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table
2095 ======================================== ===== ============================
2096 Enumeration Name Value Description
2097 ======================================== ===== ============================
2098 SYSTEM_VGPR_WORKITEM_ID_X 0 Set work-item X dimension
2100 SYSTEM_VGPR_WORKITEM_ID_X_Y 1 Set work-item X and Y
2102 SYSTEM_VGPR_WORKITEM_ID_X_Y_Z 2 Set work-item X, Y and Z
2104 SYSTEM_VGPR_WORKITEM_ID_UNDEFINED 3 Undefined.
2105 ======================================== ===== ============================
2107 .. _amdgpu-amdhsa-initial-kernel-execution-state:
2109 Initial Kernel Execution State
2110 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
2112 This section defines the register state that will be set up by the packet
2113 processor prior to the start of execution of every wavefront. This is limited by
2114 the constraints of the hardware controllers of CP/ADC/SPI.
2116 The order of the SGPR registers is defined, but the compiler can specify which
2117 ones are actually setup in the kernel descriptor using the ``enable_sgpr_*`` bit
2118 fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used
2119 for enabled registers are dense starting at SGPR0: the first enabled register is
2120 SGPR0, the next enabled register is SGPR1 etc.; disabled registers do not have
2123 The initial SGPRs comprise up to 16 User SRGPs that are set by CP and apply to
2124 all wavefronts of the grid. It is possible to specify more than 16 User SGPRs using
2125 the ``enable_sgpr_*`` bit fields, in which case only the first 16 are actually
2126 initialized. These are then immediately followed by the System SGPRs that are
2127 set up by ADC/SPI and can have different values for each wavefront of the grid
2130 SGPR register initial state is defined in
2131 :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
2133 .. table:: SGPR Register Set Up Order
2134 :name: amdgpu-amdhsa-sgpr-register-set-up-order-table
2136 ========== ========================== ====== ==============================
2137 SGPR Order Name Number Description
2138 (kernel descriptor enable of
2140 ========== ========================== ====== ==============================
2141 First Private Segment Buffer 4 V# that can be used, together
2142 (enable_sgpr_private with Scratch Wavefront Offset
2143 _segment_buffer) as an offset, to access the
2144 private memory space using a
2147 CP uses the value provided by
2149 then Dispatch Ptr 2 64 bit address of AQL dispatch
2150 (enable_sgpr_dispatch_ptr) packet for kernel dispatch
2152 then Queue Ptr 2 64 bit address of amd_queue_t
2153 (enable_sgpr_queue_ptr) object for AQL queue on which
2154 the dispatch packet was
2156 then Kernarg Segment Ptr 2 64 bit address of Kernarg
2157 (enable_sgpr_kernarg segment. This is directly
2158 _segment_ptr) copied from the
2159 kernarg_address in the kernel
2162 Having CP load it once avoids
2163 loading it at the beginning of
2165 then Dispatch Id 2 64 bit Dispatch ID of the
2166 (enable_sgpr_dispatch_id) dispatch packet being
2168 then Flat Scratch Init 2 This is 2 SGPRs:
2169 (enable_sgpr_flat_scratch
2173 The first SGPR is a 32 bit
2175 ``SH_HIDDEN_PRIVATE_BASE_VIMID``
2176 to per SPI base of memory
2177 for scratch for the queue
2178 executing the kernel
2179 dispatch. CP obtains this
2180 from the runtime. (The
2181 Scratch Segment Buffer base
2183 ``SH_HIDDEN_PRIVATE_BASE_VIMID``
2184 plus this offset.) The value
2185 of Scratch Wavefront Offset must
2186 be added to this offset by
2187 the kernel machine code,
2188 right shifted by 8, and
2189 moved to the FLAT_SCRATCH_HI
2191 FLAT_SCRATCH_HI corresponds
2192 to SGPRn-4 on GFX7, and
2193 SGPRn-6 on GFX8 (where SGPRn
2194 is the highest numbered SGPR
2195 allocated to the wavefront).
2197 multiplied by 256 (as it is
2198 in units of 256 bytes) and
2200 ``SH_HIDDEN_PRIVATE_BASE_VIMID``
2201 to calculate the per wavefront
2202 FLAT SCRATCH BASE in flat
2203 memory instructions that
2207 The second SGPR is 32 bit
2208 byte size of a single
2209 work-item's scratch memory
2210 usage. CP obtains this from
2211 the runtime, and it is
2212 always a multiple of DWORD.
2213 CP checks that the value in
2214 the kernel dispatch packet
2215 Private Segment Byte Size is
2216 not larger, and requests the
2217 runtime to increase the
2218 queue's scratch size if
2219 necessary. The kernel code
2221 FLAT_SCRATCH_LO which is
2222 SGPRn-3 on GFX7 and SGPRn-5
2223 on GFX8. FLAT_SCRATCH_LO is
2224 used as the FLAT SCRATCH
2226 instructions. Having CP load
2227 it once avoids loading it at
2228 the beginning of every
2232 64 bit base address of the
2233 per SPI scratch backing
2234 memory managed by SPI for
2235 the queue executing the
2236 kernel dispatch. CP obtains
2237 this from the runtime (and
2238 divides it if there are
2239 multiple Shader Arrays each
2240 with its own SPI). The value
2241 of Scratch Wavefront Offset must
2242 be added by the kernel
2243 machine code and the result
2244 moved to the FLAT_SCRATCH
2245 SGPR which is SGPRn-6 and
2246 SGPRn-5. It is used as the
2247 FLAT SCRATCH BASE in flat
2248 memory instructions.
2249 then Private Segment Size 1 The 32 bit byte size of a
2250 (enable_sgpr_private single
2252 scratch_segment_size) memory
2253 allocation. This is the
2254 value from the kernel
2255 dispatch packet Private
2256 Segment Byte Size rounded up
2257 by CP to a multiple of
2260 Having CP load it once avoids
2261 loading it at the beginning of
2264 This is not used for
2265 GFX7-GFX8 since it is the same
2266 value as the second SGPR of
2267 Flat Scratch Init. However, it
2268 may be needed for GFX9 which
2269 changes the meaning of the
2270 Flat Scratch Init value.
2271 then Grid Work-Group Count X 1 32 bit count of the number of
2272 (enable_sgpr_grid work-groups in the X dimension
2273 _workgroup_count_X) for the grid being
2274 executed. Computed from the
2275 fields in the kernel dispatch
2276 packet as ((grid_size.x +
2277 workgroup_size.x - 1) /
2279 then Grid Work-Group Count Y 1 32 bit count of the number of
2280 (enable_sgpr_grid work-groups in the Y dimension
2281 _workgroup_count_Y && for the grid being
2282 less than 16 previous executed. Computed from the
2283 SGPRs) fields in the kernel dispatch
2284 packet as ((grid_size.y +
2285 workgroup_size.y - 1) /
2288 Only initialized if <16
2289 previous SGPRs initialized.
2290 then Grid Work-Group Count Z 1 32 bit count of the number of
2291 (enable_sgpr_grid work-groups in the Z dimension
2292 _workgroup_count_Z && for the grid being
2293 less than 16 previous executed. Computed from the
2294 SGPRs) fields in the kernel dispatch
2295 packet as ((grid_size.z +
2296 workgroup_size.z - 1) /
2299 Only initialized if <16
2300 previous SGPRs initialized.
2301 then Work-Group Id X 1 32 bit work-group id in X
2302 (enable_sgpr_workgroup_id dimension of grid for
2304 then Work-Group Id Y 1 32 bit work-group id in Y
2305 (enable_sgpr_workgroup_id dimension of grid for
2307 then Work-Group Id Z 1 32 bit work-group id in Z
2308 (enable_sgpr_workgroup_id dimension of grid for
2310 then Work-Group Info 1 {first_wavefront, 14'b0000,
2311 (enable_sgpr_workgroup ordered_append_term[10:0],
2312 _info) threadgroup_size_in_wavefronts[5:0]}
2313 then Scratch Wavefront Offset 1 32 bit byte offset from base
2314 (enable_sgpr_private of scratch base of queue
2315 _segment_wavefront_offset) executing the kernel
2316 dispatch. Must be used as an
2318 segment address when using
2319 Scratch Segment Buffer. It
2320 must be used to set up FLAT
2321 SCRATCH for flat addressing
2323 :ref:`amdgpu-amdhsa-flat-scratch`).
2324 ========== ========================== ====== ==============================
2326 The order of the VGPR registers is defined, but the compiler can specify which
2327 ones are actually setup in the kernel descriptor using the ``enable_vgpr*`` bit
2328 fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used
2329 for enabled registers are dense starting at VGPR0: the first enabled register is
2330 VGPR0, the next enabled register is VGPR1 etc.; disabled registers do not have a
2333 VGPR register initial state is defined in
2334 :ref:`amdgpu-amdhsa-vgpr-register-set-up-order-table`.
2336 .. table:: VGPR Register Set Up Order
2337 :name: amdgpu-amdhsa-vgpr-register-set-up-order-table
2339 ========== ========================== ====== ==============================
2340 VGPR Order Name Number Description
2341 (kernel descriptor enable of
2343 ========== ========================== ====== ==============================
2344 First Work-Item Id X 1 32 bit work item id in X
2345 (Always initialized) dimension of work-group for
2347 then Work-Item Id Y 1 32 bit work item id in Y
2348 (enable_vgpr_workitem_id dimension of work-group for
2349 > 0) wavefront lane.
2350 then Work-Item Id Z 1 32 bit work item id in Z
2351 (enable_vgpr_workitem_id dimension of work-group for
2352 > 1) wavefront lane.
2353 ========== ========================== ====== ==============================
2355 The setting of registers is done by GPU CP/ADC/SPI hardware as follows:
2357 1. SGPRs before the Work-Group Ids are set by CP using the 16 User Data
2359 2. Work-group Id registers X, Y, Z are set by ADC which supports any
2360 combination including none.
2361 3. Scratch Wavefront Offset is set by SPI in a per wavefront basis which is why
2362 its value cannot included with the flat scratch init value which is per queue.
2363 4. The VGPRs are set by SPI which only supports specifying either (X), (X, Y)
2366 Flat Scratch register pair are adjacent SGRRs so they can be moved as a 64 bit
2367 value to the hardware required SGPRn-3 and SGPRn-4 respectively.
2369 The global segment can be accessed either using buffer instructions (GFX6 which
2370 has V# 64 bit address support), flat instructions (GFX7-GFX9), or global
2371 instructions (GFX9).
2373 If buffer operations are used then the compiler can generate a V# with the
2374 following properties:
2378 * ATC: 1 if IOMMU present (such as APU)
2380 * MTYPE set to support memory coherence that matches the runtime (such as CC for
2381 APU and NC for dGPU).
2383 .. _amdgpu-amdhsa-kernel-prolog:
2388 .. _amdgpu-amdhsa-m0:
2394 The M0 register must be initialized with a value at least the total LDS size
2395 if the kernel may access LDS via DS or flat operations. Total LDS size is
2396 available in dispatch packet. For M0, it is also possible to use maximum
2397 possible value of LDS for given target (0x7FFF for GFX6 and 0xFFFF for
2400 The M0 register is not used for range checking LDS accesses and so does not
2401 need to be initialized in the prolog.
2403 .. _amdgpu-amdhsa-flat-scratch:
2408 If the kernel may use flat operations to access scratch memory, the prolog code
2409 must set up FLAT_SCRATCH register pair (FLAT_SCRATCH_LO/FLAT_SCRATCH_HI which
2410 are in SGPRn-4/SGPRn-3). Initialization uses Flat Scratch Init and Scratch Wavefront
2411 Offset SGPR registers (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`):
2414 Flat scratch is not supported.
2417 1. The low word of Flat Scratch Init is 32 bit byte offset from
2418 ``SH_HIDDEN_PRIVATE_BASE_VIMID`` to the base of scratch backing memory
2419 being managed by SPI for the queue executing the kernel dispatch. This is
2420 the same value used in the Scratch Segment Buffer V# base address. The
2421 prolog must add the value of Scratch Wavefront Offset to get the wavefront's byte
2422 scratch backing memory offset from ``SH_HIDDEN_PRIVATE_BASE_VIMID``. Since
2423 FLAT_SCRATCH_LO is in units of 256 bytes, the offset must be right shifted
2424 by 8 before moving into FLAT_SCRATCH_LO.
2425 2. The second word of Flat Scratch Init is 32 bit byte size of a single
2426 work-items scratch memory usage. This is directly loaded from the kernel
2427 dispatch packet Private Segment Byte Size and rounded up to a multiple of
2428 DWORD. Having CP load it once avoids loading it at the beginning of every
2429 wavefront. The prolog must move it to FLAT_SCRATCH_LO for use as FLAT SCRATCH
2433 The Flat Scratch Init is the 64 bit address of the base of scratch backing
2434 memory being managed by SPI for the queue executing the kernel dispatch. The
2435 prolog must add the value of Scratch Wavefront Offset and moved to the FLAT_SCRATCH
2436 pair for use as the flat scratch base in flat memory instructions.
2438 .. _amdgpu-amdhsa-memory-model:
2443 This section describes the mapping of LLVM memory model onto AMDGPU machine code
2444 (see :ref:`memmodel`). *The implementation is WIP.*
2447 Update when implementation complete.
2449 The AMDGPU backend supports the memory synchronization scopes specified in
2450 :ref:`amdgpu-memory-scopes`.
2452 The code sequences used to implement the memory model are defined in table
2453 :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table`.
2455 The sequences specify the order of instructions that a single thread must
2456 execute. The ``s_waitcnt`` and ``buffer_wbinvl1_vol`` are defined with respect
2457 to other memory instructions executed by the same thread. This allows them to be
2458 moved earlier or later which can allow them to be combined with other instances
2459 of the same instruction, or hoisted/sunk out of loops to improve
2460 performance. Only the instructions related to the memory model are given;
2461 additional ``s_waitcnt`` instructions are required to ensure registers are
2462 defined before being used. These may be able to be combined with the memory
2463 model ``s_waitcnt`` instructions as described above.
2465 The AMDGPU backend supports the following memory models:
2467 HSA Memory Model [HSA]_
2468 The HSA memory model uses a single happens-before relation for all address
2469 spaces (see :ref:`amdgpu-address-spaces`).
2470 OpenCL Memory Model [OpenCL]_
2471 The OpenCL memory model which has separate happens-before relations for the
2472 global and local address spaces. Only a fence specifying both global and
2473 local address space, and seq_cst instructions join the relationships. Since
2474 the LLVM ``memfence`` instruction does not allow an address space to be
2475 specified the OpenCL fence has to convervatively assume both local and
2476 global address space was specified. However, optimizations can often be
2477 done to eliminate the additional ``s_waitcnt`` instructions when there are
2478 no intervening memory instructions which access the corresponding address
2479 space. The code sequences in the table indicate what can be omitted for the
2480 OpenCL memory. The target triple environment is used to determine if the
2481 source language is OpenCL (see :ref:`amdgpu-opencl`).
2483 ``ds/flat_load/store/atomic`` instructions to local memory are termed LDS
2486 ``buffer/global/flat_load/store/atomic`` instructions to global memory are
2487 termed vector memory operations.
2491 * Each agent has multiple compute units (CU).
2492 * Each CU has multiple SIMDs that execute wavefronts.
2493 * The wavefronts for a single work-group are executed in the same CU but may be
2494 executed by different SIMDs.
2495 * Each CU has a single LDS memory shared by the wavefronts of the work-groups
2497 * All LDS operations of a CU are performed as wavefront wide operations in a
2498 global order and involve no caching. Completion is reported to a wavefront in
2500 * The LDS memory has multiple request queues shared by the SIMDs of a
2501 CU. Therefore, the LDS operations performed by different wavefronts of a work-group
2502 can be reordered relative to each other, which can result in reordering the
2503 visibility of vector memory operations with respect to LDS operations of other
2504 wavefronts in the same work-group. A ``s_waitcnt lgkmcnt(0)`` is required to
2505 ensure synchronization between LDS operations and vector memory operations
2506 between wavefronts of a work-group, but not between operations performed by the
2508 * The vector memory operations are performed as wavefront wide operations and
2509 completion is reported to a wavefront in execution order. The exception is
2510 that for GFX7-GFX9 ``flat_load/store/atomic`` instructions can report out of
2511 vector memory order if they access LDS memory, and out of LDS operation order
2512 if they access global memory.
2513 * The vector memory operations access a single vector L1 cache shared by all
2514 SIMDs a CU. Therefore, no special action is required for coherence between the
2515 lanes of a single wavefront, or for coherence between wavefronts in the same
2516 work-group. A ``buffer_wbinvl1_vol`` is required for coherence between wavefronts
2517 executing in different work-groups as they may be executing on different CUs.
2518 * The scalar memory operations access a scalar L1 cache shared by all wavefronts
2519 on a group of CUs. The scalar and vector L1 caches are not coherent. However,
2520 scalar operations are used in a restricted way so do not impact the memory
2521 model. See :ref:`amdgpu-amdhsa-memory-spaces`.
2522 * The vector and scalar memory operations use an L2 cache shared by all CUs on
2524 * The L2 cache has independent channels to service disjoint ranges of virtual
2526 * Each CU has a separate request queue per channel. Therefore, the vector and
2527 scalar memory operations performed by wavefronts executing in different work-groups
2528 (which may be executing on different CUs) of an agent can be reordered
2529 relative to each other. A ``s_waitcnt vmcnt(0)`` is required to ensure
2530 synchronization between vector memory operations of different CUs. It ensures a
2531 previous vector memory operation has completed before executing a subsequent
2532 vector memory or LDS operation and so can be used to meet the requirements of
2533 acquire and release.
2534 * The L2 cache can be kept coherent with other agents on some targets, or ranges
2535 of virtual addresses can be set up to bypass it to ensure system coherence.
2537 Private address space uses ``buffer_load/store`` using the scratch V# (GFX6-GFX8),
2538 or ``scratch_load/store`` (GFX9). Since only a single thread is accessing the
2539 memory, atomic memory orderings are not meaningful and all accesses are treated
2542 Constant address space uses ``buffer/global_load`` instructions (or equivalent
2543 scalar memory instructions). Since the constant address space contents do not
2544 change during the execution of a kernel dispatch it is not legal to perform
2545 stores, and atomic memory orderings are not meaningful and all access are
2546 treated as non-atomic.
2548 A memory synchronization scope wider than work-group is not meaningful for the
2549 group (LDS) address space and is treated as work-group.
2551 The memory model does not support the region address space which is treated as
2554 Acquire memory ordering is not meaningful on store atomic instructions and is
2555 treated as non-atomic.
2557 Release memory ordering is not meaningful on load atomic instructions and is
2558 treated a non-atomic.
2560 Acquire-release memory ordering is not meaningful on load or store atomic
2561 instructions and is treated as acquire and release respectively.
2563 AMDGPU backend only uses scalar memory operations to access memory that is
2564 proven to not change during the execution of the kernel dispatch. This includes
2565 constant address space and global address space for program scope const
2566 variables. Therefore the kernel machine code does not have to maintain the
2567 scalar L1 cache to ensure it is coherent with the vector L1 cache. The scalar
2568 and vector L1 caches are invalidated between kernel dispatches by CP since
2569 constant address space data may change between kernel dispatch executions. See
2570 :ref:`amdgpu-amdhsa-memory-spaces`.
2572 The one execption is if scalar writes are used to spill SGPR registers. In this
2573 case the AMDGPU backend ensures the memory location used to spill is never
2574 accessed by vector memory operations at the same time. If scalar writes are used
2575 then a ``s_dcache_wb`` is inserted before the ``s_endpgm`` and before a function
2576 return since the locations may be used for vector memory instructions by a
2577 future wavefront that uses the same scratch area, or a function call that creates a
2578 frame at the same address, respectively. There is no need for a ``s_dcache_inv``
2579 as all scalar writes are write-before-read in the same thread.
2581 Scratch backing memory (which is used for the private address space)
2582 is accessed with MTYPE NC_NV (non-coherenent non-volatile). Since the private
2583 address space is only accessed by a single thread, and is always
2584 write-before-read, there is never a need to invalidate these entries from the L1
2585 cache. Hence all cache invalidates are done as ``*_vol`` to only invalidate the
2586 volatile cache lines.
2588 On dGPU the kernarg backing memory is accessed as UC (uncached) to avoid needing
2589 to invalidate the L2 cache. This also causes it to be treated as
2590 non-volatile and so is not invalidated by ``*_vol``. On APU it is accessed as CC
2591 (cache coherent) and so the L2 cache will coherent with the CPU and other
2594 .. table:: AMDHSA Memory Model Code Sequences GFX6-GFX9
2595 :name: amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table
2597 ============ ============ ============== ========== ===============================
2598 LLVM Instr LLVM Memory LLVM Memory AMDGPU AMDGPU Machine Code
2599 Ordering Sync Scope Address
2601 ============ ============ ============== ========== ===============================
2603 -----------------------------------------------------------------------------------
2604 load *none* *none* - global - !volatile & !nontemporal
2606 - private 1. buffer/global/flat_load
2608 - volatile & !nontemporal
2610 1. buffer/global/flat_load
2615 1. buffer/global/flat_load
2618 load *none* *none* - local 1. ds_load
2619 store *none* *none* - global - !nontemporal
2621 - private 1. buffer/global/flat_store
2625 1. buffer/global/flat_stote
2628 store *none* *none* - local 1. ds_store
2629 **Unordered Atomic**
2630 -----------------------------------------------------------------------------------
2631 load atomic unordered *any* *any* *Same as non-atomic*.
2632 store atomic unordered *any* *any* *Same as non-atomic*.
2633 atomicrmw unordered *any* *any* *Same as monotonic
2635 **Monotonic Atomic**
2636 -----------------------------------------------------------------------------------
2637 load atomic monotonic - singlethread - global 1. buffer/global/flat_load
2638 - wavefront - generic
2640 load atomic monotonic - singlethread - local 1. ds_load
2643 load atomic monotonic - agent - global 1. buffer/global/flat_load
2644 - system - generic glc=1
2645 store atomic monotonic - singlethread - global 1. buffer/global/flat_store
2646 - wavefront - generic
2650 store atomic monotonic - singlethread - local 1. ds_store
2653 atomicrmw monotonic - singlethread - global 1. buffer/global/flat_atomic
2654 - wavefront - generic
2658 atomicrmw monotonic - singlethread - local 1. ds_atomic
2662 -----------------------------------------------------------------------------------
2663 load atomic acquire - singlethread - global 1. buffer/global/ds/flat_load
2666 load atomic acquire - workgroup - global 1. buffer/global/flat_load
2667 load atomic acquire - workgroup - local 1. ds_load
2668 2. s_waitcnt lgkmcnt(0)
2671 - Must happen before
2683 load atomic acquire - workgroup - generic 1. flat_load
2684 2. s_waitcnt lgkmcnt(0)
2687 - Must happen before
2699 load atomic acquire - agent - global 1. buffer/global/flat_load
2701 2. s_waitcnt vmcnt(0)
2703 - Must happen before
2711 3. buffer_wbinvl1_vol
2713 - Must happen before
2723 load atomic acquire - agent - generic 1. flat_load glc=1
2724 - system 2. s_waitcnt vmcnt(0) &
2729 - Must happen before
2732 - Ensures the flat_load
2737 3. buffer_wbinvl1_vol
2739 - Must happen before
2749 atomicrmw acquire - singlethread - global 1. buffer/global/ds/flat_atomic
2752 atomicrmw acquire - workgroup - global 1. buffer/global/flat_atomic
2753 atomicrmw acquire - workgroup - local 1. ds_atomic
2754 2. waitcnt lgkmcnt(0)
2757 - Must happen before
2770 atomicrmw acquire - workgroup - generic 1. flat_atomic
2771 2. waitcnt lgkmcnt(0)
2774 - Must happen before
2787 atomicrmw acquire - agent - global 1. buffer/global/flat_atomic
2788 - system 2. s_waitcnt vmcnt(0)
2790 - Must happen before
2799 3. buffer_wbinvl1_vol
2801 - Must happen before
2811 atomicrmw acquire - agent - generic 1. flat_atomic
2812 - system 2. s_waitcnt vmcnt(0) &
2817 - Must happen before
2826 3. buffer_wbinvl1_vol
2828 - Must happen before
2838 fence acquire - singlethread *none* *none*
2840 fence acquire - workgroup *none* 1. s_waitcnt lgkmcnt(0)
2845 - However, since LLVM
2870 fence-paired-atomic).
2871 - Must happen before
2882 fence-paired-atomic.
2884 fence acquire - agent *none* 1. s_waitcnt lgkmcnt(0) &
2891 - However, since LLVM
2899 - Could be split into
2908 - s_waitcnt vmcnt(0)
2919 fence-paired-atomic).
2920 - s_waitcnt lgkmcnt(0)
2931 fence-paired-atomic).
2932 - Must happen before
2946 fence-paired-atomic.
2948 2. buffer_wbinvl1_vol
2950 - Must happen before any
2951 following global/generic
2961 -----------------------------------------------------------------------------------
2962 store atomic release - singlethread - global 1. buffer/global/ds/flat_store
2965 store atomic release - workgroup - global 1. s_waitcnt lgkmcnt(0)
2974 - Must happen before
2985 2. buffer/global/flat_store
2986 store atomic release - workgroup - local 1. ds_store
2987 store atomic release - workgroup - generic 1. s_waitcnt lgkmcnt(0)
2996 - Must happen before
3008 store atomic release - agent - global 1. s_waitcnt lgkmcnt(0) &
3009 - system - generic vmcnt(0)
3013 - Could be split into
3022 - s_waitcnt vmcnt(0)
3029 - s_waitcnt lgkmcnt(0)
3036 - Must happen before
3047 2. buffer/global/ds/flat_store
3048 atomicrmw release - singlethread - global 1. buffer/global/ds/flat_atomic
3051 atomicrmw release - workgroup - global 1. s_waitcnt lgkmcnt(0)
3060 - Must happen before
3071 2. buffer/global/flat_atomic
3072 atomicrmw release - workgroup - local 1. ds_atomic
3073 atomicrmw release - workgroup - generic 1. s_waitcnt lgkmcnt(0)
3082 - Must happen before
3094 atomicrmw release - agent - global 1. s_waitcnt lgkmcnt(0) &
3095 - system - generic vmcnt(0)
3099 - Could be split into
3108 - s_waitcnt vmcnt(0)
3115 - s_waitcnt lgkmcnt(0)
3122 - Must happen before
3133 2. buffer/global/ds/flat_atomic
3134 fence release - singlethread *none* *none*
3136 fence release - workgroup *none* 1. s_waitcnt lgkmcnt(0)
3141 - However, since LLVM
3162 - Must happen before
3171 fence-paired-atomic).
3178 fence-paired-atomic.
3180 fence release - agent *none* 1. s_waitcnt lgkmcnt(0) &
3191 - However, since LLVM
3206 - Could be split into
3215 - s_waitcnt vmcnt(0)
3222 - s_waitcnt lgkmcnt(0)
3229 - Must happen before
3238 fence-paired-atomic).
3245 fence-paired-atomic.
3247 **Acquire-Release Atomic**
3248 -----------------------------------------------------------------------------------
3249 atomicrmw acq_rel - singlethread - global 1. buffer/global/ds/flat_atomic
3252 atomicrmw acq_rel - workgroup - global 1. s_waitcnt lgkmcnt(0)
3261 - Must happen before
3272 2. buffer/global/flat_atomic
3273 atomicrmw acq_rel - workgroup - local 1. ds_atomic
3274 2. s_waitcnt lgkmcnt(0)
3277 - Must happen before
3290 atomicrmw acq_rel - workgroup - generic 1. s_waitcnt lgkmcnt(0)
3299 - Must happen before
3311 3. s_waitcnt lgkmcnt(0)
3314 - Must happen before
3327 atomicrmw acq_rel - agent - global 1. s_waitcnt lgkmcnt(0) &
3332 - Could be split into
3341 - s_waitcnt vmcnt(0)
3348 - s_waitcnt lgkmcnt(0)
3355 - Must happen before
3366 2. buffer/global/flat_atomic
3367 3. s_waitcnt vmcnt(0)
3369 - Must happen before
3378 4. buffer_wbinvl1_vol
3380 - Must happen before
3390 atomicrmw acq_rel - agent - generic 1. s_waitcnt lgkmcnt(0) &
3395 - Could be split into
3404 - s_waitcnt vmcnt(0)
3411 - s_waitcnt lgkmcnt(0)
3418 - Must happen before
3430 3. s_waitcnt vmcnt(0) &
3435 - Must happen before
3444 4. buffer_wbinvl1_vol
3446 - Must happen before
3456 fence acq_rel - singlethread *none* *none*
3458 fence acq_rel - workgroup *none* 1. s_waitcnt lgkmcnt(0)
3478 - Must happen before
3501 acquire-fence-paired-atomic
3522 release-fence-paired-atomic
3523 ). This satisfies the
3527 fence acq_rel - agent *none* 1. s_waitcnt lgkmcnt(0) &
3534 - However, since LLVM
3542 - Could be split into
3551 - s_waitcnt vmcnt(0)
3558 - s_waitcnt lgkmcnt(0)
3565 - Must happen before
3570 global/local/generic
3579 acquire-fence-paired-atomic
3591 global/local/generic
3600 release-fence-paired-atomic
3601 ). This satisfies the
3605 2. buffer_wbinvl1_vol
3607 - Must happen before
3621 **Sequential Consistent Atomic**
3622 -----------------------------------------------------------------------------------
3623 load atomic seq_cst - singlethread - global *Same as corresponding
3624 - wavefront - local load atomic acquire,
3625 - generic except must generated
3626 all instructions even
3628 load atomic seq_cst - workgroup - global 1. s_waitcnt lgkmcnt(0)
3643 lgkmcnt(0) and so do
3678 instructions same as
3681 except must generated
3682 all instructions even
3684 load atomic seq_cst - workgroup - local *Same as corresponding
3685 load atomic acquire,
3686 except must generated
3687 all instructions even
3689 load atomic seq_cst - agent - global 1. s_waitcnt lgkmcnt(0) &
3690 - system - generic vmcnt(0)
3692 - Could be split into
3701 - waitcnt lgkmcnt(0)
3714 lgkmcnt(0) and so do
3765 instructions same as
3768 except must generated
3769 all instructions even
3771 store atomic seq_cst - singlethread - global *Same as corresponding
3772 - wavefront - local store atomic release,
3773 - workgroup - generic except must generated
3774 all instructions even
3776 store atomic seq_cst - agent - global *Same as corresponding
3777 - system - generic store atomic release,
3778 except must generated
3779 all instructions even
3781 atomicrmw seq_cst - singlethread - global *Same as corresponding
3782 - wavefront - local atomicrmw acq_rel,
3783 - workgroup - generic except must generated
3784 all instructions even
3786 atomicrmw seq_cst - agent - global *Same as corresponding
3787 - system - generic atomicrmw acq_rel,
3788 except must generated
3789 all instructions even
3791 fence seq_cst - singlethread *none* *Same as corresponding
3792 - wavefront fence acq_rel,
3793 - workgroup except must generated
3794 - agent all instructions even
3795 - system for OpenCL.*
3796 ============ ============ ============== ========== ===============================
3798 The memory order also adds the single thread optimization constrains defined in
3800 :ref:`amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-gfx6-gfx9-table`.
3802 .. table:: AMDHSA Memory Model Single Thread Optimization Constraints GFX6-GFX9
3803 :name: amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-gfx6-gfx9-table
3805 ============ ==============================================================
3806 LLVM Memory Optimization Constraints
3808 ============ ==============================================================
3811 acquire - If a load atomic/atomicrmw then no following load/load
3812 atomic/store/ store atomic/atomicrmw/fence instruction can
3813 be moved before the acquire.
3814 - If a fence then same as load atomic, plus no preceding
3815 associated fence-paired-atomic can be moved after the fence.
3816 release - If a store atomic/atomicrmw then no preceding load/load
3817 atomic/store/ store atomic/atomicrmw/fence instruction can
3818 be moved after the release.
3819 - If a fence then same as store atomic, plus no following
3820 associated fence-paired-atomic can be moved before the
3822 acq_rel Same constraints as both acquire and release.
3823 seq_cst - If a load atomic then same constraints as acquire, plus no
3824 preceding sequentially consistent load atomic/store
3825 atomic/atomicrmw/fence instruction can be moved after the
3827 - If a store atomic then the same constraints as release, plus
3828 no following sequentially consistent load atomic/store
3829 atomic/atomicrmw/fence instruction can be moved before the
3831 - If an atomicrmw/fence then same constraints as acq_rel.
3832 ============ ==============================================================
3837 For code objects generated by AMDGPU backend for HSA [HSA]_ compatible runtimes
3838 (such as ROCm [AMD-ROCm]_), the runtime installs a trap handler that supports
3839 the ``s_trap`` instruction with the following usage:
3841 .. table:: AMDGPU Trap Handler for AMDHSA OS
3842 :name: amdgpu-trap-handler-for-amdhsa-os-table
3844 =================== =============== =============== =======================
3845 Usage Code Sequence Trap Handler Description
3847 =================== =============== =============== =======================
3848 reserved ``s_trap 0x00`` Reserved by hardware.
3849 ``debugtrap(arg)`` ``s_trap 0x01`` ``SGPR0-1``: Reserved for HSA
3850 ``queue_ptr`` ``debugtrap``
3851 ``VGPR0``: intrinsic (not
3852 ``arg`` implemented).
3853 ``llvm.trap`` ``s_trap 0x02`` ``SGPR0-1``: Causes dispatch to be
3854 ``queue_ptr`` terminated and its
3855 associated queue put
3856 into the error state.
3857 ``llvm.debugtrap`` ``s_trap 0x03`` - If debugger not
3867 - If the debugger is
3869 the debug trap to be
3873 the halt state until
3876 reserved ``s_trap 0x04`` Reserved.
3877 reserved ``s_trap 0x05`` Reserved.
3878 reserved ``s_trap 0x06`` Reserved.
3879 debugger breakpoint ``s_trap 0x07`` Reserved for debugger
3881 reserved ``s_trap 0x08`` Reserved.
3882 reserved ``s_trap 0xfe`` Reserved.
3883 reserved ``s_trap 0xff`` Reserved.
3884 =================== =============== =============== =======================
3889 This section provides code conventions used when the target triple OS is
3890 ``amdpal`` (see :ref:`amdgpu-target-triples`) for passing runtime parameters
3891 from the application/runtime to each invocation of a hardware shader. These
3892 parameters include both generic, application-controlled parameters called
3893 *user data* as well as system-generated parameters that are a product of the
3894 draw or dispatch execution.
3899 Each hardware stage has a set of 32-bit *user data registers* which can be
3900 written from a command buffer and then loaded into SGPRs when waves are launched
3901 via a subsequent dispatch or draw operation. This is the way most arguments are
3902 passed from the application/runtime to a hardware shader.
3907 Compute shader user data mappings are simpler than graphics shaders, and have a
3910 Note that there are always 10 available *user data entries* in registers -
3911 entries beyond that limit must be fetched from memory (via the spill table
3912 pointer) by the shader.
3914 .. table:: PAL Compute Shader User Data Registers
3915 :name: pal-compute-user-data-registers
3917 ============= ================================
3918 User Register Description
3919 ============= ================================
3920 0 Global Internal Table (32-bit pointer)
3921 1 Per-Shader Internal Table (32-bit pointer)
3922 2 - 11 Application-Controlled User Data (10 32-bit values)
3923 12 Spill Table (32-bit pointer)
3924 13 - 14 Thread Group Count (64-bit pointer)
3926 ============= ================================
3931 Graphics pipelines support a much more flexible user data mapping:
3933 .. table:: PAL Graphics Shader User Data Registers
3934 :name: pal-graphics-user-data-registers
3936 ============= ================================
3937 User Register Description
3938 ============= ================================
3939 0 Global Internal Table (32-bit pointer)
3940 + Per-Shader Internal Table (32-bit pointer)
3941 + 1-15 Application Controlled User Data
3942 (1-15 Contiguous 32-bit Values in Registers)
3943 + Spill Table (32-bit pointer)
3944 + Draw Index (First Stage Only)
3945 + Vertex Offset (First Stage Only)
3946 + Instance Offset (First Stage Only)
3947 ============= ================================
3949 The placement of the global internal table remains fixed in the first *user
3950 data SGPR register*. Otherwise all parameters are optional, and can be mapped
3951 to any desired *user data SGPR register*, with the following regstrictions:
3953 * Draw Index, Vertex Offset, and Instance Offset can only be used by the first
3954 activehardware stage in a graphics pipeline (i.e. where the API vertex
3957 * Application-controlled user data must be mapped into a contiguous range of
3958 user data registers.
3960 * The application-controlled user data range supports compaction remapping, so
3961 only *entries* that are actually consumed by the shader must be assigned to
3962 corresponding *registers*. Note that in order to support an efficient runtime
3963 implementation, the remapping must pack *registers* in the same order as
3964 *entries*, with unused *entries* removed.
3966 .. _pal_global_internal_table:
3968 Global Internal Table
3969 ~~~~~~~~~~~~~~~~~~~~~
3971 The global internal table is a table of *shader resource descriptors* (SRDs) that
3972 define how certain engine-wide, runtime-managed resources should be accessed
3973 from a shader. The majority of these resources have HW-defined formats, and it
3974 is up to the compiler to write/read data as required by the target hardware.
3976 The following table illustrates the required format:
3978 .. table:: PAL Global Internal Table
3979 :name: pal-git-table
3981 ============= ================================
3983 ============= ================================
3984 0-3 Graphics Scratch SRD
3985 4-7 Compute Scratch SRD
3986 8-11 ES/GS Ring Output SRD
3987 12-15 ES/GS Ring Input SRD
3988 16-19 GS/VS Ring Output #0
3989 20-23 GS/VS Ring Output #1
3990 24-27 GS/VS Ring Output #2
3991 28-31 GS/VS Ring Output #3
3992 32-35 GS/VS Ring Input SRD
3993 36-39 Tessellation Factor Buffer SRD
3994 40-43 Off-Chip LDS Buffer SRD
3995 44-47 Off-Chip Param Cache Buffer SRD
3996 48-51 Sample Position Buffer SRD
3997 52 vaRange::ShadowDescriptorTable High Bits
3998 ============= ================================
4000 The pointer to the global internal table passed to the shader as user data
4001 is a 32-bit pointer. The top 32 bits should be assumed to be the same as
4002 the top 32 bits of the pipeline, so the shader may use the program
4003 counter's top 32 bits.
4008 This section provides code conventions used when the target triple OS is
4009 empty (see :ref:`amdgpu-target-triples`).
4014 For code objects generated by AMDGPU backend for non-amdhsa OS, the runtime does
4015 not install a trap handler. The ``llvm.trap`` and ``llvm.debugtrap``
4016 instructions are handled as follows:
4018 .. table:: AMDGPU Trap Handler for Non-AMDHSA OS
4019 :name: amdgpu-trap-handler-for-non-amdhsa-os-table
4021 =============== =============== ===========================================
4022 Usage Code Sequence Description
4023 =============== =============== ===========================================
4024 llvm.trap s_endpgm Causes wavefront to be terminated.
4025 llvm.debugtrap *none* Compiler warning given that there is no
4026 trap handler installed.
4027 =============== =============== ===========================================
4037 When the language is OpenCL the following differences occur:
4039 1. The OpenCL memory model is used (see :ref:`amdgpu-amdhsa-memory-model`).
4040 2. The AMDGPU backend appends additional arguments to the kernel's explicit
4041 arguments for the AMDHSA OS (see
4042 :ref:`opencl-kernel-implicit-arguments-appended-for-amdhsa-os-table`).
4043 3. Additional metadata is generated
4044 (see :ref:`amdgpu-amdhsa-code-object-metadata`).
4046 .. table:: OpenCL kernel implicit arguments appended for AMDHSA OS
4047 :name: opencl-kernel-implicit-arguments-appended-for-amdhsa-os-table
4049 ======== ==== ========= ===========================================
4050 Position Byte Byte Description
4052 ======== ==== ========= ===========================================
4053 1 8 8 OpenCL Global Offset X
4054 2 8 8 OpenCL Global Offset Y
4055 3 8 8 OpenCL Global Offset Z
4056 4 8 8 OpenCL address of printf buffer
4057 5 8 8 OpenCL address of virtual queue used by
4059 6 8 8 OpenCL address of AqlWrap struct used by
4061 ======== ==== ========= ===========================================
4068 When the language is HCC the following differences occur:
4070 1. The HSA memory model is used (see :ref:`amdgpu-amdhsa-memory-model`).
4072 .. _amdgpu-assembler:
4077 AMDGPU backend has LLVM-MC based assembler which is currently in development.
4078 It supports AMDGCN GFX6-GFX9.
4080 This section describes general syntax for instructions and operands.
4093 An instruction has the following syntax:
4095 *<opcode> <operand0>, <operand1>,... <modifier0> <modifier1>...*
4097 Note that operands are normally comma-separated while modifiers are space-separated.
4099 The order of operands and modifiers is fixed. Most modifiers are optional and may be omitted.
4101 See detailed instruction syntax description for :doc:`GFX7<AMDGPUAsmGFX7>`,
4102 :doc:`GFX8<AMDGPUAsmGFX8>` and :doc:`GFX9<AMDGPUAsmGFX9>`.
4104 Note that features under development are not included in this description.
4106 For more information about instructions, their semantics and supported combinations of
4107 operands, refer to one of instruction set architecture manuals
4108 [AMD-GCN-GFX6]_, [AMD-GCN-GFX7]_, [AMD-GCN-GFX8]_ and [AMD-GCN-GFX9]_.
4113 The following syntax for register operands is supported:
4115 * SGPR registers: s0, ... or s[0], ...
4116 * VGPR registers: v0, ... or v[0], ...
4117 * TTMP registers: ttmp0, ... or ttmp[0], ...
4118 * Special registers: exec (exec_lo, exec_hi), vcc (vcc_lo, vcc_hi), flat_scratch (flat_scratch_lo, flat_scratch_hi)
4119 * Special trap registers: tba (tba_lo, tba_hi), tma (tma_lo, tma_hi)
4120 * Register pairs, quads, etc: s[2:3], v[10:11], ttmp[5:6], s[4:7], v[12:15], ttmp[4:7], s[8:15], ...
4121 * Register lists: [s0, s1], [ttmp0, ttmp1, ttmp2, ttmp3]
4122 * Register index expressions: v[2*2], s[1-1:2-1]
4123 * 'off' indicates that an operand is not enabled
4128 Detailed description of modifiers may be found :doc:`here<AMDGPUOperandSyntax>`.
4130 Instruction Examples
4131 ~~~~~~~~~~~~~~~~~~~~
4136 .. code-block:: nasm
4138 ds_add_u32 v2, v4 offset:16
4139 ds_write_src2_b64 v2 offset0:4 offset1:8
4140 ds_cmpst_f32 v2, v4, v6
4141 ds_min_rtn_f64 v[8:9], v2, v[4:5]
4144 For full list of supported instructions, refer to "LDS/GDS instructions" in ISA Manual.
4149 .. code-block:: nasm
4151 flat_load_dword v1, v[3:4]
4152 flat_store_dwordx3 v[3:4], v[5:7]
4153 flat_atomic_swap v1, v[3:4], v5 glc
4154 flat_atomic_cmpswap v1, v[3:4], v[5:6] glc slc
4155 flat_atomic_fmax_x2 v[1:2], v[3:4], v[5:6] glc
4157 For full list of supported instructions, refer to "FLAT instructions" in ISA Manual.
4162 .. code-block:: nasm
4164 buffer_load_dword v1, off, s[4:7], s1
4165 buffer_store_dwordx4 v[1:4], v2, ttmp[4:7], s1 offen offset:4 glc tfe
4166 buffer_store_format_xy v[1:2], off, s[4:7], s1
4168 buffer_atomic_inc v1, v2, s[8:11], s4 idxen offset:4 slc
4170 For full list of supported instructions, refer to "MUBUF Instructions" in ISA Manual.
4175 .. code-block:: nasm
4177 s_load_dword s1, s[2:3], 0xfc
4178 s_load_dwordx8 s[8:15], s[2:3], s4
4179 s_load_dwordx16 s[88:103], s[2:3], s4
4183 For full list of supported instructions, refer to "Scalar Memory Operations" in ISA Manual.
4188 .. code-block:: nasm
4191 s_mov_b64 s[0:1], 0x80000000
4193 s_wqm_b64 s[2:3], s[4:5]
4194 s_bcnt0_i32_b64 s1, s[2:3]
4195 s_swappc_b64 s[2:3], s[4:5]
4196 s_cbranch_join s[4:5]
4198 For full list of supported instructions, refer to "SOP1 Instructions" in ISA Manual.
4203 .. code-block:: nasm
4205 s_add_u32 s1, s2, s3
4206 s_and_b64 s[2:3], s[4:5], s[6:7]
4207 s_cselect_b32 s1, s2, s3
4208 s_andn2_b32 s2, s4, s6
4209 s_lshr_b64 s[2:3], s[4:5], s6
4210 s_ashr_i32 s2, s4, s6
4211 s_bfm_b64 s[2:3], s4, s6
4212 s_bfe_i64 s[2:3], s[4:5], s6
4213 s_cbranch_g_fork s[4:5], s[6:7]
4215 For full list of supported instructions, refer to "SOP2 Instructions" in ISA Manual.
4220 .. code-block:: nasm
4223 s_bitcmp1_b32 s1, s2
4224 s_bitcmp0_b64 s[2:3], s4
4227 For full list of supported instructions, refer to "SOPC Instructions" in ISA Manual.
4232 .. code-block:: nasm
4237 s_waitcnt 0 ; Wait for all counters to be 0
4238 s_waitcnt vmcnt(0) & expcnt(0) & lgkmcnt(0) ; Equivalent to above
4239 s_waitcnt vmcnt(1) ; Wait for vmcnt counter to be 1.
4243 s_sendmsg sendmsg(MSG_INTERRUPT)
4246 For full list of supported instructions, refer to "SOPP Instructions" in ISA Manual.
4248 Unless otherwise mentioned, little verification is performed on the operands
4249 of SOPP Instructions, so it is up to the programmer to be familiar with the
4250 range or acceptable values.
4255 For vector ALU instruction opcodes (VOP1, VOP2, VOP3, VOPC, VOP_DPP, VOP_SDWA),
4256 the assembler will automatically use optimal encoding based on its operands.
4257 To force specific encoding, one can add a suffix to the opcode of the instruction:
4259 * _e32 for 32-bit VOP1/VOP2/VOPC
4260 * _e64 for 64-bit VOP3
4262 * _sdwa for VOP_SDWA
4264 VOP1/VOP2/VOP3/VOPC examples:
4266 .. code-block:: nasm
4269 v_mov_b32_e32 v1, v2
4271 v_cvt_f64_i32_e32 v[1:2], v2
4272 v_floor_f32_e32 v1, v2
4273 v_bfrev_b32_e32 v1, v2
4274 v_add_f32_e32 v1, v2, v3
4275 v_mul_i32_i24_e64 v1, v2, 3
4276 v_mul_i32_i24_e32 v1, -3, v3
4277 v_mul_i32_i24_e32 v1, -100, v3
4278 v_addc_u32 v1, s[0:1], v2, v3, s[2:3]
4279 v_max_f16_e32 v1, v2, v3
4283 .. code-block:: nasm
4285 v_mov_b32 v0, v0 quad_perm:[0,2,1,1]
4286 v_sin_f32 v0, v0 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
4287 v_mov_b32 v0, v0 wave_shl:1
4288 v_mov_b32 v0, v0 row_mirror
4289 v_mov_b32 v0, v0 row_bcast:31
4290 v_mov_b32 v0, v0 quad_perm:[1,3,0,1] row_mask:0xa bank_mask:0x1 bound_ctrl:0
4291 v_add_f32 v0, v0, |v0| row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
4292 v_max_f16 v1, v2, v3 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
4296 .. code-block:: nasm
4298 v_mov_b32 v1, v2 dst_sel:BYTE_0 dst_unused:UNUSED_PRESERVE src0_sel:DWORD
4299 v_min_u32 v200, v200, v1 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_1 src1_sel:DWORD
4300 v_sin_f32 v0, v0 dst_unused:UNUSED_PAD src0_sel:WORD_1
4301 v_fract_f32 v0, |v0| dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1
4302 v_cmpx_le_u32 vcc, v1, v2 src0_sel:BYTE_2 src1_sel:WORD_0
4304 For full list of supported instructions, refer to "Vector ALU instructions".
4307 Remove once we switch to code object v3 by default.
4309 HSA Code Object Directives
4310 ~~~~~~~~~~~~~~~~~~~~~~~~~~
4312 AMDGPU ABI defines auxiliary data in output code object. In assembly source,
4313 one can specify them with assembler directives.
4315 .hsa_code_object_version major, minor
4316 +++++++++++++++++++++++++++++++++++++
4318 *major* and *minor* are integers that specify the version of the HSA code
4319 object that will be generated by the assembler.
4321 .hsa_code_object_isa [major, minor, stepping, vendor, arch]
4322 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
4325 *major*, *minor*, and *stepping* are all integers that describe the instruction
4326 set architecture (ISA) version of the assembly program.
4328 *vendor* and *arch* are quoted strings. *vendor* should always be equal to
4329 "AMD" and *arch* should always be equal to "AMDGPU".
4331 By default, the assembler will derive the ISA version, *vendor*, and *arch*
4332 from the value of the -mcpu option that is passed to the assembler.
4334 .amdgpu_hsa_kernel (name)
4335 +++++++++++++++++++++++++
4337 This directives specifies that the symbol with given name is a kernel entry point
4338 (label) and the object should contain corresponding symbol of type STT_AMDGPU_HSA_KERNEL.
4343 This directive marks the beginning of a list of key / value pairs that are used
4344 to specify the amd_kernel_code_t object that will be emitted by the assembler.
4345 The list must be terminated by the *.end_amd_kernel_code_t* directive. For
4346 any amd_kernel_code_t values that are unspecified a default value will be
4347 used. The default value for all keys is 0, with the following exceptions:
4349 - *kernel_code_version_major* defaults to 1.
4350 - *machine_kind* defaults to 1.
4351 - *machine_version_major*, *machine_version_minor*, and
4352 *machine_version_stepping* are derived from the value of the -mcpu option
4353 that is passed to the assembler.
4354 - *kernel_code_entry_byte_offset* defaults to 256.
4355 - *wavefront_size* defaults to 6.
4356 - *kernarg_segment_alignment*, *group_segment_alignment*, and
4357 *private_segment_alignment* default to 4. Note that alignments are specified
4358 as a power of two, so a value of **n** means an alignment of 2^ **n**.
4360 The *.amd_kernel_code_t* directive must be placed immediately after the
4361 function label and before any instructions.
4363 For a full list of amd_kernel_code_t keys, refer to AMDGPU ABI document,
4364 comments in lib/Target/AMDGPU/AmdKernelCodeT.h and test/CodeGen/AMDGPU/hsa.s.
4366 Here is an example of a minimal amd_kernel_code_t specification:
4368 .. code-block:: none
4370 .hsa_code_object_version 1,0
4371 .hsa_code_object_isa
4376 .amdgpu_hsa_kernel hello_world
4381 enable_sgpr_kernarg_segment_ptr = 1
4383 compute_pgm_rsrc1_vgprs = 0
4384 compute_pgm_rsrc1_sgprs = 0
4385 compute_pgm_rsrc2_user_sgpr = 2
4386 kernarg_segment_byte_size = 8
4387 wavefront_sgpr_count = 2
4388 workitem_vgpr_count = 3
4389 .end_amd_kernel_code_t
4391 s_load_dwordx2 s[0:1], s[0:1] 0x0
4392 v_mov_b32 v0, 3.14159
4393 s_waitcnt lgkmcnt(0)
4396 flat_store_dword v[1:2], v0
4399 .size hello_world, .Lfunc_end0-hello_world
4401 Predefined Symbols (-mattr=+code-object-v3)
4402 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
4404 The AMDGPU assembler defines and updates some symbols automatically. These
4405 symbols do not affect code generation.
4407 .amdgcn.gfx_generation_number
4408 +++++++++++++++++++++++++++++
4410 Set to the GFX generation number of the target being assembled for. For
4411 example, when assembling for a "GFX9" target this will be set to the integer
4412 value "9". The possible GFX generation numbers are presented in
4413 :ref:`amdgpu-processors`.
4415 .amdgcn.next_free_vgpr
4416 ++++++++++++++++++++++
4418 Set to zero before assembly begins. At each instruction, if the current value
4419 of this symbol is less than or equal to the maximum VGPR number explicitly
4420 referenced within that instruction then the symbol value is updated to equal
4421 that VGPR number plus one.
4423 May be used to set the `.amdhsa_next_free_vpgr` directive in
4424 :ref:`amdhsa-kernel-directives-table`.
4426 May be set at any time, e.g. manually set to zero at the start of each kernel.
4428 .amdgcn.next_free_sgpr
4429 ++++++++++++++++++++++
4431 Set to zero before assembly begins. At each instruction, if the current value
4432 of this symbol is less than or equal the maximum SGPR number explicitly
4433 referenced within that instruction then the symbol value is updated to equal
4434 that SGPR number plus one.
4436 May be used to set the `.amdhsa_next_free_spgr` directive in
4437 :ref:`amdhsa-kernel-directives-table`.
4439 May be set at any time, e.g. manually set to zero at the start of each kernel.
4441 Code Object Directives (-mattr=+code-object-v3)
4442 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
4444 Directives which begin with ``.amdgcn`` are valid for all ``amdgcn``
4445 architecture processors, and are not OS-specific. Directives which begin with
4446 ``.amdhsa`` are specific to ``amdgcn`` architecture processors when the
4447 ``amdhsa`` OS is specified. See :ref:`amdgpu-target-triples` and
4448 :ref:`amdgpu-processors`.
4450 .amdgcn_target <target>
4451 +++++++++++++++++++++++
4453 Optional directive which declares the target supported by the containing
4454 assembler source file. Valid values are described in
4455 :ref:`amdgpu-amdhsa-code-object-target-identification`. Used by the assembler
4456 to validate command-line options such as ``-triple``, ``-mcpu``, and those
4457 which specify target features.
4459 .amdhsa_kernel <name>
4460 +++++++++++++++++++++
4462 Creates a correctly aligned AMDHSA kernel descriptor and a symbol,
4463 ``<name>.kd``, in the current location of the current section. Only valid when
4464 the OS is ``amdhsa``. ``<name>`` must be a symbol that labels the first
4465 instruction to execute, and does not need to be previously defined.
4467 Marks the beginning of a list of directives used to generate the bytes of a
4468 kernel descriptor, as described in :ref:`amdgpu-amdhsa-kernel-descriptor`.
4469 Directives which may appear in this list are described in
4470 :ref:`amdhsa-kernel-directives-table`. Directives may appear in any order, must
4471 be valid for the target being assembled for, and cannot be repeated. Directives
4472 support the range of values specified by the field they reference in
4473 :ref:`amdgpu-amdhsa-kernel-descriptor`. If a directive is not specified, it is
4474 assumed to have its default value, unless it is marked as "Required", in which
4475 case it is an error to omit the directive. This list of directives is
4476 terminated by an ``.end_amdhsa_kernel`` directive.
4478 .. table:: AMDHSA Kernel Assembler Directives
4479 :name: amdhsa-kernel-directives-table
4481 ======================================================== ================ ============ ===================
4482 Directive Default Supported On Description
4483 ======================================================== ================ ============ ===================
4484 ``.amdhsa_group_segment_fixed_size`` 0 GFX6-GFX9 Controls GROUP_SEGMENT_FIXED_SIZE in
4485 :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx9-table`.
4486 ``.amdhsa_private_segment_fixed_size`` 0 GFX6-GFX9 Controls PRIVATE_SEGMENT_FIXED_SIZE in
4487 :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx9-table`.
4488 ``.amdhsa_user_sgpr_private_segment_buffer`` 0 GFX6-GFX9 Controls ENABLE_SGPR_PRIVATE_SEGMENT_BUFFER in
4489 :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx9-table`.
4490 ``.amdhsa_user_sgpr_dispatch_ptr`` 0 GFX6-GFX9 Controls ENABLE_SGPR_DISPATCH_PTR in
4491 :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx9-table`.
4492 ``.amdhsa_user_sgpr_queue_ptr`` 0 GFX6-GFX9 Controls ENABLE_SGPR_QUEUE_PTR in
4493 :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx9-table`.
4494 ``.amdhsa_user_sgpr_kernarg_segment_ptr`` 0 GFX6-GFX9 Controls ENABLE_SGPR_KERNARG_SEGMENT_PTR in
4495 :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx9-table`.
4496 ``.amdhsa_user_sgpr_dispatch_id`` 0 GFX6-GFX9 Controls ENABLE_SGPR_DISPATCH_ID in
4497 :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx9-table`.
4498 ``.amdhsa_user_sgpr_flat_scratch_init`` 0 GFX6-GFX9 Controls ENABLE_SGPR_FLAT_SCRATCH_INIT in
4499 :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx9-table`.
4500 ``.amdhsa_user_sgpr_private_segment_size`` 0 GFX6-GFX9 Controls ENABLE_SGPR_PRIVATE_SEGMENT_SIZE in
4501 :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx9-table`.
4502 ``.amdhsa_system_sgpr_private_segment_wavefront_offset`` 0 GFX6-GFX9 Controls ENABLE_SGPR_PRIVATE_SEGMENT_WAVEFRONT_OFFSET in
4503 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`.
4504 ``.amdhsa_system_sgpr_workgroup_id_x`` 1 GFX6-GFX9 Controls ENABLE_SGPR_WORKGROUP_ID_X in
4505 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`.
4506 ``.amdhsa_system_sgpr_workgroup_id_y`` 0 GFX6-GFX9 Controls ENABLE_SGPR_WORKGROUP_ID_Y in
4507 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`.
4508 ``.amdhsa_system_sgpr_workgroup_id_z`` 0 GFX6-GFX9 Controls ENABLE_SGPR_WORKGROUP_ID_Z in
4509 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`.
4510 ``.amdhsa_system_sgpr_workgroup_info`` 0 GFX6-GFX9 Controls ENABLE_SGPR_WORKGROUP_INFO in
4511 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`.
4512 ``.amdhsa_system_vgpr_workitem_id`` 0 GFX6-GFX9 Controls ENABLE_VGPR_WORKITEM_ID in
4513 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`.
4514 Possible values are defined in
4515 :ref:`amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table`.
4516 ``.amdhsa_next_free_vgpr`` Required GFX6-GFX9 Maximum VGPR number explicitly referenced, plus one.
4517 Used to calculate GRANULATED_WORKITEM_VGPR_COUNT in
4518 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table`.
4519 ``.amdhsa_next_free_sgpr`` Required GFX6-GFX9 Maximum SGPR number explicitly referenced, plus one.
4520 Used to calculate GRANULATED_WAVEFRONT_SGPR_COUNT in
4521 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table`.
4522 ``.amdhsa_reserve_vcc`` 1 GFX6-GFX9 Whether the kernel may use the special VCC SGPR.
4523 Used to calculate GRANULATED_WAVEFRONT_SGPR_COUNT in
4524 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table`.
4525 ``.amdhsa_reserve_flat_scratch`` 1 GFX7-GFX9 Whether the kernel may use flat instructions to access
4526 scratch memory. Used to calculate
4527 GRANULATED_WAVEFRONT_SGPR_COUNT in
4528 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table`.
4529 ``.amdhsa_reserve_xnack_mask`` Target GFX8-GFX9 Whether the kernel may trigger XNACK replay.
4530 Feature Used to calculate GRANULATED_WAVEFRONT_SGPR_COUNT in
4531 Specific :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table`.
4533 ``.amdhsa_float_round_mode_32`` 0 GFX6-GFX9 Controls FLOAT_ROUND_MODE_32 in
4534 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table`.
4535 Possible values are defined in
4536 :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
4537 ``.amdhsa_float_round_mode_16_64`` 0 GFX6-GFX9 Controls FLOAT_ROUND_MODE_16_64 in
4538 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table`.
4539 Possible values are defined in
4540 :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
4541 ``.amdhsa_float_denorm_mode_32`` 0 GFX6-GFX9 Controls FLOAT_DENORM_MODE_32 in
4542 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table`.
4543 Possible values are defined in
4544 :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
4545 ``.amdhsa_float_denorm_mode_16_64`` 3 GFX6-GFX9 Controls FLOAT_DENORM_MODE_16_64 in
4546 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table`.
4547 Possible values are defined in
4548 :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
4549 ``.amdhsa_dx10_clamp`` 1 GFX6-GFX9 Controls ENABLE_DX10_CLAMP in
4550 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table`.
4551 ``.amdhsa_ieee_mode`` 1 GFX6-GFX9 Controls ENABLE_IEEE_MODE in
4552 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table`.
4553 ``.amdhsa_fp16_overflow`` 0 GFX9 Controls FP16_OVFL in
4554 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table`.
4555 ``.amdhsa_exception_fp_ieee_invalid_op`` 0 GFX6-GFX9 Controls ENABLE_EXCEPTION_IEEE_754_FP_INVALID_OPERATION in
4556 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`.
4557 ``.amdhsa_exception_fp_denorm_src`` 0 GFX6-GFX9 Controls ENABLE_EXCEPTION_FP_DENORMAL_SOURCE in
4558 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`.
4559 ``.amdhsa_exception_fp_ieee_div_zero`` 0 GFX6-GFX9 Controls ENABLE_EXCEPTION_IEEE_754_FP_DIVISION_BY_ZERO in
4560 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`.
4561 ``.amdhsa_exception_fp_ieee_overflow`` 0 GFX6-GFX9 Controls ENABLE_EXCEPTION_IEEE_754_FP_OVERFLOW in
4562 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`.
4563 ``.amdhsa_exception_fp_ieee_underflow`` 0 GFX6-GFX9 Controls ENABLE_EXCEPTION_IEEE_754_FP_UNDERFLOW in
4564 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`.
4565 ``.amdhsa_exception_fp_ieee_inexact`` 0 GFX6-GFX9 Controls ENABLE_EXCEPTION_IEEE_754_FP_INEXACT in
4566 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`.
4567 ``.amdhsa_exception_int_div_zero`` 0 GFX6-GFX9 Controls ENABLE_EXCEPTION_INT_DIVIDE_BY_ZERO in
4568 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`.
4569 ======================================================== ================ ============ ===================
4571 Example HSA Source Code (-mattr=+code-object-v3)
4572 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
4574 Here is an example of a minimal assembly source file, defining one HSA kernel:
4576 .. code-block:: none
4578 .amdgcn_target "amdgcn-amd-amdhsa--gfx900+xnack" // optional
4583 .type hello_world,@function
4585 s_load_dwordx2 s[0:1], s[0:1] 0x0
4586 v_mov_b32 v0, 3.14159
4587 s_waitcnt lgkmcnt(0)
4590 flat_store_dword v[1:2], v0
4593 .size hello_world, .Lfunc_end0-hello_world
4597 .amdhsa_kernel hello_world
4598 .amdhsa_user_sgpr_kernarg_segment_ptr 1
4599 .amdhsa_next_free_vgpr .amdgcn.next_free_vgpr
4600 .amdhsa_next_free_sgpr .amdgcn.next_free_sgpr
4604 Additional Documentation
4605 ========================
4607 .. [AMD-RADEON-HD-2000-3000] `AMD R6xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R600_Instruction_Set_Architecture.pdf>`__
4608 .. [AMD-RADEON-HD-4000] `AMD R7xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R700-Family_Instruction_Set_Architecture.pdf>`__
4609 .. [AMD-RADEON-HD-5000] `AMD Evergreen shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_Evergreen-Family_Instruction_Set_Architecture.pdf>`__
4610 .. [AMD-RADEON-HD-6000] `AMD Cayman/Trinity shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_HD_6900_Series_Instruction_Set_Architecture.pdf>`__
4611 .. [AMD-GCN-GFX6] `AMD Southern Islands Series ISA <http://developer.amd.com/wordpress/media/2012/12/AMD_Southern_Islands_Instruction_Set_Architecture.pdf>`__
4612 .. [AMD-GCN-GFX7] `AMD Sea Islands Series ISA <http://developer.amd.com/wordpress/media/2013/07/AMD_Sea_Islands_Instruction_Set_Architecture.pdf>`_
4613 .. [AMD-GCN-GFX8] `AMD GCN3 Instruction Set Architecture <http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/12/AMD_GCN3_Instruction_Set_Architecture_rev1.1.pdf>`__
4614 .. [AMD-GCN-GFX9] `AMD "Vega" Instruction Set Architecture <http://developer.amd.com/wordpress/media/2013/12/Vega_Shader_ISA_28July2017.pdf>`__
4615 .. [AMD-ROCm] `ROCm: Open Platform for Development, Discovery and Education Around GPU Computing <http://gpuopen.com/compute-product/rocm/>`__
4616 .. [AMD-ROCm-github] `ROCm github <http://github.com/RadeonOpenCompute>`__
4617 .. [HSA] `Heterogeneous System Architecture (HSA) Foundation <http://www.hsafoundation.com/>`__
4618 .. [ELF] `Executable and Linkable Format (ELF) <http://www.sco.com/developers/gabi/>`__
4619 .. [DWARF] `DWARF Debugging Information Format <http://dwarfstd.org/>`__
4620 .. [YAML] `YAML Ain't Markup Language (YAML™) Version 1.2 <http://www.yaml.org/spec/1.2/spec.html>`__
4621 .. [OpenCL] `The OpenCL Specification Version 2.0 <http://www.khronos.org/registry/cl/specs/opencl-2.0.pdf>`__
4622 .. [HRF] `Heterogeneous-race-free Memory Models <http://benedictgaster.org/wp-content/uploads/2014/01/asplos269-FINAL.pdf>`__
4623 .. [CLANG-ATTR] `Attributes in Clang <http://clang.llvm.org/docs/AttributeReference.html>`__