1 =============================
2 User Guide for AMDGPU Backend
3 =============================
11 The AMDGPU backend provides ISA code generation for AMD GPUs, starting with the
12 R600 family up until the current GCN families. It lives in the
13 ``lib/Target/AMDGPU`` directory.
18 .. _amdgpu-target-triples:
23 Use the ``clang -target <Architecture>-<Vendor>-<OS>-<Environment>`` option to
24 specify the target triple:
26 .. table:: AMDGPU Architectures
27 :name: amdgpu-architecture-table
29 ============ ==============================================================
30 Architecture Description
31 ============ ==============================================================
32 ``r600`` AMD GPUs HD2XXX-HD6XXX for graphics and compute shaders.
33 ``amdgcn`` AMD GPUs GCN GFX6 onwards for graphics and compute shaders.
34 ============ ==============================================================
36 .. table:: AMDGPU Vendors
37 :name: amdgpu-vendor-table
39 ============ ==============================================================
41 ============ ==============================================================
42 ``amd`` Can be used for all AMD GPU usage.
43 ``mesa3d`` Can be used if the OS is ``mesa3d``.
44 ============ ==============================================================
46 .. table:: AMDGPU Operating Systems
47 :name: amdgpu-os-table
49 ============== ============================================================
51 ============== ============================================================
52 *<empty>* Defaults to the *unknown* OS.
53 ``amdhsa`` Compute kernels executed on HSA [HSA]_ compatible runtimes
54 such as AMD's ROCm [AMD-ROCm]_.
55 ``amdpal`` Graphic shaders and compute kernels executed on AMD PAL
57 ``mesa3d`` Graphic shaders and compute kernels executed on Mesa 3D
59 ============== ============================================================
61 .. table:: AMDGPU Environments
62 :name: amdgpu-environment-table
64 ============ ==============================================================
65 Environment Description
66 ============ ==============================================================
68 ============ ==============================================================
70 .. _amdgpu-processors:
75 Use the ``clang -mcpu <Processor>`` option to specify the AMD GPU processor. The
76 names from both the *Processor* and *Alternative Processor* can be used.
78 .. table:: AMDGPU Processors
79 :name: amdgpu-processor-table
81 =========== =============== ============ ===== ================= ======= ======================
82 Processor Alternative Target dGPU/ Target ROCm Example
83 Processor Triple APU Features Support Products
84 Architecture Supported
86 =========== =============== ============ ===== ================= ======= ======================
87 **Radeon HD 2000/3000 Series (R600)** [AMD-RADEON-HD-2000-3000]_
88 -----------------------------------------------------------------------------------------------
89 ``r600`` ``r600`` dGPU
90 ``r630`` ``r600`` dGPU
91 ``rs880`` ``r600`` dGPU
92 ``rv670`` ``r600`` dGPU
93 **Radeon HD 4000 Series (R700)** [AMD-RADEON-HD-4000]_
94 -----------------------------------------------------------------------------------------------
95 ``rv710`` ``r600`` dGPU
96 ``rv730`` ``r600`` dGPU
97 ``rv770`` ``r600`` dGPU
98 **Radeon HD 5000 Series (Evergreen)** [AMD-RADEON-HD-5000]_
99 -----------------------------------------------------------------------------------------------
100 ``cedar`` ``r600`` dGPU
101 ``cypress`` ``r600`` dGPU
102 ``juniper`` ``r600`` dGPU
103 ``redwood`` ``r600`` dGPU
104 ``sumo`` ``r600`` dGPU
105 **Radeon HD 6000 Series (Northern Islands)** [AMD-RADEON-HD-6000]_
106 -----------------------------------------------------------------------------------------------
107 ``barts`` ``r600`` dGPU
108 ``caicos`` ``r600`` dGPU
109 ``cayman`` ``r600`` dGPU
110 ``turks`` ``r600`` dGPU
111 **GCN GFX6 (Southern Islands (SI))** [AMD-GCN-GFX6]_
112 -----------------------------------------------------------------------------------------------
113 ``gfx600`` - ``tahiti`` ``amdgcn`` dGPU
114 ``gfx601`` - ``hainan`` ``amdgcn`` dGPU
118 **GCN GFX7 (Sea Islands (CI))** [AMD-GCN-GFX7]_
119 -----------------------------------------------------------------------------------------------
120 ``gfx700`` - ``kaveri`` ``amdgcn`` APU - A6-7000
130 ``gfx701`` - ``hawaii`` ``amdgcn`` dGPU ROCm - FirePro W8100
134 ``gfx702`` ``amdgcn`` dGPU ROCm - Radeon R9 290
138 ``gfx703`` - ``kabini`` ``amdgcn`` APU - E1-2100
139 - ``mullins`` - E1-2200
147 ``gfx704`` - ``bonaire`` ``amdgcn`` dGPU - Radeon HD 7790
151 **GCN GFX8 (Volcanic Islands (VI))** [AMD-GCN-GFX8]_
152 -----------------------------------------------------------------------------------------------
153 ``gfx801`` - ``carrizo`` ``amdgcn`` APU - xnack - A6-8500P
159 \ ``amdgcn`` APU - xnack ROCm - A10-8700P
162 \ ``amdgcn`` APU - xnack - A10-9600P
168 \ ``amdgcn`` APU - xnack - E2-9010
171 ``gfx802`` - ``iceland`` ``amdgcn`` dGPU - xnack ROCm - FirePro S7150
172 - ``tonga`` [off] - FirePro S7100
179 ``gfx803`` - ``fiji`` ``amdgcn`` dGPU - xnack ROCm - Radeon R9 Nano
180 [off] - Radeon R9 Fury
184 - Radeon Instinct MI8
185 \ - ``polaris10`` ``amdgcn`` dGPU - xnack ROCm - Radeon RX 470
186 [off] - Radeon RX 480
187 - Radeon Instinct MI6
188 \ - ``polaris11`` ``amdgcn`` dGPU - xnack ROCm - Radeon RX 460
190 ``gfx810`` - ``stoney`` ``amdgcn`` APU - xnack
192 **GCN GFX9** [AMD-GCN-GFX9]_
193 -----------------------------------------------------------------------------------------------
194 ``gfx900`` ``amdgcn`` dGPU - xnack ROCm - Radeon Vega
195 [off] Frontier Edition
200 - Radeon Instinct MI25
201 ``gfx902`` ``amdgcn`` APU - xnack - Ryzen 3 2200G
203 ``gfx904`` ``amdgcn`` dGPU - xnack *TBA*
208 ``gfx906`` ``amdgcn`` dGPU - xnack - Radeon Instinct MI50
209 [off] - Radeon Instinct MI60
210 ``gfx908`` ``amdgcn`` dGPU - xnack *TBA*
214 ``gfx909`` ``amdgcn`` APU - xnack *TBA* (Raven Ridge 2)
219 **GCN GFX10** [AMD-GCN-GFX10]_
220 -----------------------------------------------------------------------------------------------
221 ``gfx1010`` ``amdgcn`` dGPU - xnack *TBA*
230 ``gfx1011`` ``amdgcn`` dGPU - xnack *TBA*
239 ``gfx1012`` ``amdgcn`` dGPU - xnack *TBA*
248 =========== =============== ============ ===== ================= ======= ======================
250 .. _amdgpu-target-features:
255 Target features control how code is generated to support certain
256 processor specific features. Not all target features are supported by
257 all processors. The runtime must ensure that the features supported by
258 the device used to execute the code match the features enabled when
259 generating the code. A mismatch of features may result in incorrect
260 execution, or a reduction in performance.
262 The target features supported by each processor, and the default value
263 used if not specified explicitly, is listed in
264 :ref:`amdgpu-processor-table`.
266 Use the ``clang -m[no-]<TargetFeature>`` option to specify the AMD GPU
272 Enable the ``xnack`` feature.
274 Disable the ``xnack`` feature.
276 .. table:: AMDGPU Target Features
277 :name: amdgpu-target-feature-table
279 ====================== ==================================================
280 Target Feature Description
281 ====================== ==================================================
282 -m[no-]xnack Enable/disable generating code that has
283 memory clauses that are compatible with
284 having XNACK replay enabled.
286 This is used for demand paging and page
287 migration. If XNACK replay is enabled in
288 the device, then if a page fault occurs
289 the code may execute incorrectly if the
290 ``xnack`` feature is not enabled. Executing
291 code that has the feature enabled on a
292 device that does not have XNACK replay
293 enabled will execute correctly, but may
294 be less performant than code with the
297 -m[no-]sram-ecc Enable/disable generating code that assumes SRAM
298 ECC is enabled/disabled.
300 -m[no-]wavefrontsize64 Control the default wavefront size used when
301 generating code for kernels. When disabled
302 native wavefront size 32 is used, when enabled
303 wavefront size 64 is used.
305 -m[no-]cumode Control the default wavefront execution mode used
306 when generating code for kernels. When disabled
307 native WGP wavefront execution mode is used,
308 when enabled CU wavefront execution mode is used
309 (see :ref:`amdgpu-amdhsa-memory-model`).
310 ====================== ==================================================
312 .. _amdgpu-address-spaces:
317 The AMDGPU backend uses the following address space mappings.
319 The memory space names used in the table, aside from the region memory space, is
320 from the OpenCL standard.
322 LLVM Address Space number is used throughout LLVM (for example, in LLVM IR).
324 .. table:: Address Space Mapping
325 :name: amdgpu-address-space-mapping-table
327 ================== =================================
328 LLVM Address Space Memory Space
329 ================== =================================
337 7 Buffer Fat Pointer (experimental)
338 ================== =================================
340 The buffer fat pointer is an experimental address space that is currently
341 unsupported in the backend. It exposes a non-integral pointer that is in future
342 intended to support the modelling of 128-bit buffer descriptors + a 32-bit
343 offset into the buffer descriptor (in total encapsulating a 160-bit 'pointer'),
344 allowing us to use normal LLVM load/store/atomic operations to model the buffer
345 descriptors used heavily in graphics workloads targeting the backend.
347 .. _amdgpu-memory-scopes:
352 This section provides LLVM memory synchronization scopes supported by the AMDGPU
353 backend memory model when the target triple OS is ``amdhsa`` (see
354 :ref:`amdgpu-amdhsa-memory-model` and :ref:`amdgpu-target-triples`).
356 The memory model supported is based on the HSA memory model [HSA]_ which is
357 based in turn on HRF-indirect with scope inclusion [HRF]_. The happens-before
358 relation is transitive over the synchonizes-with relation independent of scope,
359 and synchonizes-with allows the memory scope instances to be inclusive (see
360 table :ref:`amdgpu-amdhsa-llvm-sync-scopes-table`).
362 This is different to the OpenCL [OpenCL]_ memory model which does not have scope
363 inclusion and requires the memory scopes to exactly match. However, this
364 is conservatively correct for OpenCL.
366 .. table:: AMDHSA LLVM Sync Scopes
367 :name: amdgpu-amdhsa-llvm-sync-scopes-table
369 ======================= ===================================================
370 LLVM Sync Scope Description
371 ======================= ===================================================
372 *none* The default: ``system``.
374 Synchronizes with, and participates in modification
375 and seq_cst total orderings with, other operations
376 (except image operations) for all address spaces
377 (except private, or generic that accesses private)
378 provided the other operation's sync scope is:
381 - ``agent`` and executed by a thread on the same
383 - ``workgroup`` and executed by a thread in the
385 - ``wavefront`` and executed by a thread in the
388 ``agent`` Synchronizes with, and participates in modification
389 and seq_cst total orderings with, other operations
390 (except image operations) for all address spaces
391 (except private, or generic that accesses private)
392 provided the other operation's sync scope is:
394 - ``system`` or ``agent`` and executed by a thread
396 - ``workgroup`` and executed by a thread in the
398 - ``wavefront`` and executed by a thread in the
401 ``workgroup`` Synchronizes with, and participates in modification
402 and seq_cst total orderings with, other operations
403 (except image operations) for all address spaces
404 (except private, or generic that accesses private)
405 provided the other operation's sync scope is:
407 - ``system``, ``agent`` or ``workgroup`` and
408 executed by a thread in the same workgroup.
409 - ``wavefront`` and executed by a thread in the
412 ``wavefront`` Synchronizes with, and participates in modification
413 and seq_cst total orderings with, other operations
414 (except image operations) for all address spaces
415 (except private, or generic that accesses private)
416 provided the other operation's sync scope is:
418 - ``system``, ``agent``, ``workgroup`` or
419 ``wavefront`` and executed by a thread in the
422 ``singlethread`` Only synchronizes with, and participates in
423 modification and seq_cst total orderings with,
424 other operations (except image operations) running
425 in the same thread for all address spaces (for
426 example, in signal handlers).
428 ``one-as`` Same as ``system`` but only synchronizes with other
429 operations within the same address space.
431 ``agent-one-as`` Same as ``agent`` but only synchronizes with other
432 operations within the same address space.
434 ``workgroup-one-as`` Same as ``workgroup`` but only synchronizes with
435 other operations within the same address space.
437 ``wavefront-one-as`` Same as ``wavefront`` but only synchronizes with
438 other operations within the same address space.
440 ``singlethread-one-as`` Same as ``singlethread`` but only synchronizes with
441 other operations within the same address space.
442 ======================= ===================================================
447 The AMDGPU backend implements the following LLVM IR intrinsics.
449 *This section is WIP.*
452 List AMDGPU intrinsics
457 The AMDGPU backend supports the following LLVM IR attributes.
459 .. table:: AMDGPU LLVM IR Attributes
460 :name: amdgpu-llvm-ir-attributes-table
462 ======================================= ==========================================================
463 LLVM Attribute Description
464 ======================================= ==========================================================
465 "amdgpu-flat-work-group-size"="min,max" Specify the minimum and maximum flat work group sizes that
466 will be specified when the kernel is dispatched. Generated
467 by the ``amdgpu_flat_work_group_size`` CLANG attribute [CLANG-ATTR]_.
468 "amdgpu-implicitarg-num-bytes"="n" Number of kernel argument bytes to add to the kernel
469 argument block size for the implicit arguments. This
470 varies by OS and language (for OpenCL see
471 :ref:`opencl-kernel-implicit-arguments-appended-for-amdhsa-os-table`).
472 "amdgpu-num-sgpr"="n" Specifies the number of SGPRs to use. Generated by
473 the ``amdgpu_num_sgpr`` CLANG attribute [CLANG-ATTR]_.
474 "amdgpu-num-vgpr"="n" Specifies the number of VGPRs to use. Generated by the
475 ``amdgpu_num_vgpr`` CLANG attribute [CLANG-ATTR]_.
476 "amdgpu-waves-per-eu"="m,n" Specify the minimum and maximum number of waves per
477 execution unit. Generated by the ``amdgpu_waves_per_eu``
478 CLANG attribute [CLANG-ATTR]_.
479 "amdgpu-ieee" true/false. Specify whether the function expects the IEEE field of the
480 mode register to be set on entry. Overrides the default for
481 the calling convention.
482 "amdgpu-dx10-clamp" true/false. Specify whether the function expects the DX10_CLAMP field of
483 the mode register to be set on entry. Overrides the default
484 for the calling convention.
485 ======================================= ==========================================================
490 The AMDGPU backend generates a standard ELF [ELF]_ relocatable code object that
491 can be linked by ``lld`` to produce a standard ELF shared code object which can
492 be loaded and executed on an AMDGPU target.
497 The AMDGPU backend uses the following ELF header:
499 .. table:: AMDGPU ELF Header
500 :name: amdgpu-elf-header-table
502 ========================== ===============================
504 ========================== ===============================
505 ``e_ident[EI_CLASS]`` ``ELFCLASS64``
506 ``e_ident[EI_DATA]`` ``ELFDATA2LSB``
507 ``e_ident[EI_OSABI]`` - ``ELFOSABI_NONE``
508 - ``ELFOSABI_AMDGPU_HSA``
509 - ``ELFOSABI_AMDGPU_PAL``
510 - ``ELFOSABI_AMDGPU_MESA3D``
511 ``e_ident[EI_ABIVERSION]`` - ``ELFABIVERSION_AMDGPU_HSA``
512 - ``ELFABIVERSION_AMDGPU_PAL``
513 - ``ELFABIVERSION_AMDGPU_MESA3D``
514 ``e_type`` - ``ET_REL``
516 ``e_machine`` ``EM_AMDGPU``
518 ``e_flags`` See :ref:`amdgpu-elf-header-e_flags-table`
519 ========================== ===============================
523 .. table:: AMDGPU ELF Header Enumeration Values
524 :name: amdgpu-elf-header-enumeration-values-table
526 =============================== =====
528 =============================== =====
531 ``ELFOSABI_AMDGPU_HSA`` 64
532 ``ELFOSABI_AMDGPU_PAL`` 65
533 ``ELFOSABI_AMDGPU_MESA3D`` 66
534 ``ELFABIVERSION_AMDGPU_HSA`` 1
535 ``ELFABIVERSION_AMDGPU_PAL`` 0
536 ``ELFABIVERSION_AMDGPU_MESA3D`` 0
537 =============================== =====
539 ``e_ident[EI_CLASS]``
542 * ``ELFCLASS32`` for ``r600`` architecture.
544 * ``ELFCLASS64`` for ``amdgcn`` architecture which only supports 64
548 All AMDGPU targets use ``ELFDATA2LSB`` for little-endian byte ordering.
550 ``e_ident[EI_OSABI]``
551 One of the following AMD GPU architecture specific OS ABIs
552 (see :ref:`amdgpu-os-table`):
554 * ``ELFOSABI_NONE`` for *unknown* OS.
556 * ``ELFOSABI_AMDGPU_HSA`` for ``amdhsa`` OS.
558 * ``ELFOSABI_AMDGPU_PAL`` for ``amdpal`` OS.
560 * ``ELFOSABI_AMDGPU_MESA3D`` for ``mesa3D`` OS.
562 ``e_ident[EI_ABIVERSION]``
563 The ABI version of the AMD GPU architecture specific OS ABI to which the code
566 * ``ELFABIVERSION_AMDGPU_HSA`` is used to specify the version of AMD HSA
569 * ``ELFABIVERSION_AMDGPU_PAL`` is used to specify the version of AMD PAL
572 * ``ELFABIVERSION_AMDGPU_MESA3D`` is used to specify the version of AMD MESA
576 Can be one of the following values:
580 The type produced by the AMD GPU backend compiler as it is relocatable code
584 The type produced by the linker as it is a shared code object.
586 The AMD HSA runtime loader requires a ``ET_DYN`` code object.
589 The value ``EM_AMDGPU`` is used for the machine for all processors supported
590 by the ``r600`` and ``amdgcn`` architectures (see
591 :ref:`amdgpu-processor-table`). The specific processor is specified in the
592 ``EF_AMDGPU_MACH`` bit field of the ``e_flags`` (see
593 :ref:`amdgpu-elf-header-e_flags-table`).
596 The entry point is 0 as the entry points for individual kernels must be
597 selected in order to invoke them through AQL packets.
600 The AMDGPU backend uses the following ELF header flags:
602 .. table:: AMDGPU ELF Header ``e_flags``
603 :name: amdgpu-elf-header-e_flags-table
605 ================================= ========== =============================
606 Name Value Description
607 ================================= ========== =============================
608 **AMDGPU Processor Flag** See :ref:`amdgpu-processor-table`.
609 -------------------------------------------- -----------------------------
610 ``EF_AMDGPU_MACH`` 0x000000ff AMDGPU processor selection
612 ``EF_AMDGPU_MACH_xxx`` values
614 :ref:`amdgpu-ef-amdgpu-mach-table`.
615 ``EF_AMDGPU_XNACK`` 0x00000100 Indicates if the ``xnack``
618 contained in the code object.
625 :ref:`amdgpu-target-features`.
626 ``EF_AMDGPU_SRAM_ECC`` 0x00000200 Indicates if the ``sram-ecc``
629 contained in the code object.
636 :ref:`amdgpu-target-features`.
637 ================================= ========== =============================
639 .. table:: AMDGPU ``EF_AMDGPU_MACH`` Values
640 :name: amdgpu-ef-amdgpu-mach-table
642 ================================= ========== =============================
643 Name Value Description (see
644 :ref:`amdgpu-processor-table`)
645 ================================= ========== =============================
646 ``EF_AMDGPU_MACH_NONE`` 0x000 *not specified*
647 ``EF_AMDGPU_MACH_R600_R600`` 0x001 ``r600``
648 ``EF_AMDGPU_MACH_R600_R630`` 0x002 ``r630``
649 ``EF_AMDGPU_MACH_R600_RS880`` 0x003 ``rs880``
650 ``EF_AMDGPU_MACH_R600_RV670`` 0x004 ``rv670``
651 ``EF_AMDGPU_MACH_R600_RV710`` 0x005 ``rv710``
652 ``EF_AMDGPU_MACH_R600_RV730`` 0x006 ``rv730``
653 ``EF_AMDGPU_MACH_R600_RV770`` 0x007 ``rv770``
654 ``EF_AMDGPU_MACH_R600_CEDAR`` 0x008 ``cedar``
655 ``EF_AMDGPU_MACH_R600_CYPRESS`` 0x009 ``cypress``
656 ``EF_AMDGPU_MACH_R600_JUNIPER`` 0x00a ``juniper``
657 ``EF_AMDGPU_MACH_R600_REDWOOD`` 0x00b ``redwood``
658 ``EF_AMDGPU_MACH_R600_SUMO`` 0x00c ``sumo``
659 ``EF_AMDGPU_MACH_R600_BARTS`` 0x00d ``barts``
660 ``EF_AMDGPU_MACH_R600_CAICOS`` 0x00e ``caicos``
661 ``EF_AMDGPU_MACH_R600_CAYMAN`` 0x00f ``cayman``
662 ``EF_AMDGPU_MACH_R600_TURKS`` 0x010 ``turks``
663 *reserved* 0x011 - Reserved for ``r600``
664 0x01f architecture processors.
665 ``EF_AMDGPU_MACH_AMDGCN_GFX600`` 0x020 ``gfx600``
666 ``EF_AMDGPU_MACH_AMDGCN_GFX601`` 0x021 ``gfx601``
667 ``EF_AMDGPU_MACH_AMDGCN_GFX700`` 0x022 ``gfx700``
668 ``EF_AMDGPU_MACH_AMDGCN_GFX701`` 0x023 ``gfx701``
669 ``EF_AMDGPU_MACH_AMDGCN_GFX702`` 0x024 ``gfx702``
670 ``EF_AMDGPU_MACH_AMDGCN_GFX703`` 0x025 ``gfx703``
671 ``EF_AMDGPU_MACH_AMDGCN_GFX704`` 0x026 ``gfx704``
672 *reserved* 0x027 Reserved.
673 ``EF_AMDGPU_MACH_AMDGCN_GFX801`` 0x028 ``gfx801``
674 ``EF_AMDGPU_MACH_AMDGCN_GFX802`` 0x029 ``gfx802``
675 ``EF_AMDGPU_MACH_AMDGCN_GFX803`` 0x02a ``gfx803``
676 ``EF_AMDGPU_MACH_AMDGCN_GFX810`` 0x02b ``gfx810``
677 ``EF_AMDGPU_MACH_AMDGCN_GFX900`` 0x02c ``gfx900``
678 ``EF_AMDGPU_MACH_AMDGCN_GFX902`` 0x02d ``gfx902``
679 ``EF_AMDGPU_MACH_AMDGCN_GFX904`` 0x02e ``gfx904``
680 ``EF_AMDGPU_MACH_AMDGCN_GFX906`` 0x02f ``gfx906``
681 ``EF_AMDGPU_MACH_AMDGCN_GFX908`` 0x030 ``gfx908``
682 ``EF_AMDGPU_MACH_AMDGCN_GFX909`` 0x031 ``gfx909``
683 *reserved* 0x032 Reserved.
684 ``EF_AMDGPU_MACH_AMDGCN_GFX1010`` 0x033 ``gfx1010``
685 ``EF_AMDGPU_MACH_AMDGCN_GFX1011`` 0x034 ``gfx1011``
686 ``EF_AMDGPU_MACH_AMDGCN_GFX1012`` 0x035 ``gfx1012``
687 ================================= ========== =============================
692 An AMDGPU target ELF code object has the standard ELF sections which include:
694 .. table:: AMDGPU ELF Sections
695 :name: amdgpu-elf-sections-table
697 ================== ================ =================================
699 ================== ================ =================================
700 ``.bss`` ``SHT_NOBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
701 ``.data`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
702 ``.debug_``\ *\** ``SHT_PROGBITS`` *none*
703 ``.dynamic`` ``SHT_DYNAMIC`` ``SHF_ALLOC``
704 ``.dynstr`` ``SHT_PROGBITS`` ``SHF_ALLOC``
705 ``.dynsym`` ``SHT_PROGBITS`` ``SHF_ALLOC``
706 ``.got`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
707 ``.hash`` ``SHT_HASH`` ``SHF_ALLOC``
708 ``.note`` ``SHT_NOTE`` *none*
709 ``.rela``\ *name* ``SHT_RELA`` *none*
710 ``.rela.dyn`` ``SHT_RELA`` *none*
711 ``.rodata`` ``SHT_PROGBITS`` ``SHF_ALLOC``
712 ``.shstrtab`` ``SHT_STRTAB`` *none*
713 ``.strtab`` ``SHT_STRTAB`` *none*
714 ``.symtab`` ``SHT_SYMTAB`` *none*
715 ``.text`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_EXECINSTR``
716 ================== ================ =================================
718 These sections have their standard meanings (see [ELF]_) and are only generated
722 The standard DWARF sections. See :ref:`amdgpu-dwarf` for information on the
723 DWARF produced by the AMDGPU backend.
725 ``.dynamic``, ``.dynstr``, ``.dynsym``, ``.hash``
726 The standard sections used by a dynamic loader.
729 See :ref:`amdgpu-note-records` for the note records supported by the AMDGPU
732 ``.rela``\ *name*, ``.rela.dyn``
733 For relocatable code objects, *name* is the name of the section that the
734 relocation records apply. For example, ``.rela.text`` is the section name for
735 relocation records associated with the ``.text`` section.
737 For linked shared code objects, ``.rela.dyn`` contains all the relocation
738 records from each of the relocatable code object's ``.rela``\ *name* sections.
740 See :ref:`amdgpu-relocation-records` for the relocation records supported by
744 The executable machine code for the kernels and functions they call. Generated
745 as position independent code. See :ref:`amdgpu-code-conventions` for
746 information on conventions used in the isa generation.
748 .. _amdgpu-note-records:
753 The AMDGPU backend code object contains ELF note records in the ``.note``
754 section. The set of generated notes and their semantics depend on the code
755 object version; see :ref:`amdgpu-note-records-v2` and
756 :ref:`amdgpu-note-records-v3`.
758 As required by ``ELFCLASS32`` and ``ELFCLASS64``, minimal zero byte padding
759 must be generated after the ``name`` field to ensure the ``desc`` field is 4
760 byte aligned. In addition, minimal zero byte padding must be generated to
761 ensure the ``desc`` field size is a multiple of 4 bytes. The ``sh_addralign``
762 field of the ``.note`` section must be at least 4 to indicate at least 8 byte
765 .. _amdgpu-note-records-v2:
767 Code Object V2 Note Records (-mattr=-code-object-v3)
768 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
770 .. warning:: Code Object V2 is not the default code object version emitted by
771 this version of LLVM. For a description of the notes generated with the
772 default configuration (Code Object V3) see :ref:`amdgpu-note-records-v3`.
774 The AMDGPU backend code object uses the following ELF note record in the
775 ``.note`` section when compiling for Code Object V2 (-mattr=-code-object-v3).
777 Additional note records may be present, but any which are not documented here
778 are deprecated and should not be used.
780 .. table:: AMDGPU Code Object V2 ELF Note Records
781 :name: amdgpu-elf-note-records-table-v2
783 ===== ============================== ======================================
784 Name Type Description
785 ===== ============================== ======================================
786 "AMD" ``NT_AMD_AMDGPU_HSA_METADATA`` <metadata null terminated string>
787 ===== ============================== ======================================
791 .. table:: AMDGPU Code Object V2 ELF Note Record Enumeration Values
792 :name: amdgpu-elf-note-record-enumeration-values-table-v2
794 ============================== =====
796 ============================== =====
798 ``NT_AMD_AMDGPU_HSA_METADATA`` 10
800 ============================== =====
802 ``NT_AMD_AMDGPU_HSA_METADATA``
803 Specifies extensible metadata associated with the code objects executed on HSA
804 [HSA]_ compatible runtimes such as AMD's ROCm [AMD-ROCm]_. It is required when
805 the target triple OS is ``amdhsa`` (see :ref:`amdgpu-target-triples`). See
806 :ref:`amdgpu-amdhsa-code-object-metadata-v2` for the syntax of the code
807 object metadata string.
809 .. _amdgpu-note-records-v3:
811 Code Object V3 Note Records (-mattr=+code-object-v3)
812 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
814 The AMDGPU backend code object uses the following ELF note record in the
815 ``.note`` section when compiling for Code Object V3 (-mattr=+code-object-v3).
817 Additional note records may be present, but any which are not documented here
818 are deprecated and should not be used.
820 .. table:: AMDGPU Code Object V3 ELF Note Records
821 :name: amdgpu-elf-note-records-table-v3
823 ======== ============================== ======================================
824 Name Type Description
825 ======== ============================== ======================================
826 "AMDGPU" ``NT_AMDGPU_METADATA`` Metadata in Message Pack [MsgPack]_
828 ======== ============================== ======================================
832 .. table:: AMDGPU Code Object V3 ELF Note Record Enumeration Values
833 :name: amdgpu-elf-note-record-enumeration-values-table-v3
835 ============================== =====
837 ============================== =====
839 ``NT_AMDGPU_METADATA`` 32
840 ============================== =====
842 ``NT_AMDGPU_METADATA``
843 Specifies extensible metadata associated with an AMDGPU code
844 object. It is encoded as a map in the Message Pack [MsgPack]_ binary
845 data format. See :ref:`amdgpu-amdhsa-code-object-metadata-v3` for the
846 map keys defined for the ``amdhsa`` OS.
853 Symbols include the following:
855 .. table:: AMDGPU ELF Symbols
856 :name: amdgpu-elf-symbols-table
858 ===================== ================== ================ ==================
859 Name Type Section Description
860 ===================== ================== ================ ==================
861 *link-name* ``STT_OBJECT`` - ``.data`` Global variable
864 *link-name*\ ``.kd`` ``STT_OBJECT`` - ``.rodata`` Kernel descriptor
865 *link-name* ``STT_FUNC`` - ``.text`` Kernel entry point
866 *link-name* ``STT_OBJECT`` - SHN_AMDGPU_LDS Global variable in LDS
867 ===================== ================== ================ ==================
870 Global variables both used and defined by the compilation unit.
872 If the symbol is defined in the compilation unit then it is allocated in the
873 appropriate section according to if it has initialized data or is readonly.
875 If the symbol is external then its section is ``STN_UNDEF`` and the loader
876 will resolve relocations using the definition provided by another code object
877 or explicitly defined by the runtime.
879 If the symbol resides in local/group memory (LDS) then its section is the
880 special processor-specific section name ``SHN_AMDGPU_LDS``, and the
881 ``st_value`` field describes alignment requirements as it does for common
885 Add description of linked shared object symbols. Seems undefined symbols
886 are marked as STT_NOTYPE.
889 Every HSA kernel has an associated kernel descriptor. It is the address of the
890 kernel descriptor that is used in the AQL dispatch packet used to invoke the
891 kernel, not the kernel entry point. The layout of the HSA kernel descriptor is
892 defined in :ref:`amdgpu-amdhsa-kernel-descriptor`.
895 Every HSA kernel also has a symbol for its machine code entry point.
897 .. _amdgpu-relocation-records:
902 AMDGPU backend generates ``Elf64_Rela`` relocation records. Supported
903 relocatable fields are:
906 This specifies a 32-bit field occupying 4 bytes with arbitrary byte
907 alignment. These values use the same byte order as other word values in the
908 AMD GPU architecture.
911 This specifies a 64-bit field occupying 8 bytes with arbitrary byte
912 alignment. These values use the same byte order as other word values in the
913 AMD GPU architecture.
915 Following notations are used for specifying relocation calculations:
918 Represents the addend used to compute the value of the relocatable field.
921 Represents the offset into the global offset table at which the relocation
922 entry's symbol will reside during execution.
925 Represents the address of the global offset table.
928 Represents the place (section offset for ``et_rel`` or address for ``et_dyn``)
929 of the storage unit being relocated (computed using ``r_offset``).
932 Represents the value of the symbol whose index resides in the relocation
933 entry. Relocations not using this must specify a symbol index of ``STN_UNDEF``.
936 Represents the base address of a loaded executable or shared object which is
937 the difference between the ELF address and the actual load address. Relocations
938 using this are only valid in executable or shared objects.
940 The following relocation types are supported:
942 .. table:: AMDGPU ELF Relocation Records
943 :name: amdgpu-elf-relocation-records-table
945 ========================== ======= ===== ========== ==============================
946 Relocation Type Kind Value Field Calculation
947 ========================== ======= ===== ========== ==============================
948 ``R_AMDGPU_NONE`` 0 *none* *none*
949 ``R_AMDGPU_ABS32_LO`` Static, 1 ``word32`` (S + A) & 0xFFFFFFFF
951 ``R_AMDGPU_ABS32_HI`` Static, 2 ``word32`` (S + A) >> 32
953 ``R_AMDGPU_ABS64`` Static, 3 ``word64`` S + A
955 ``R_AMDGPU_REL32`` Static 4 ``word32`` S + A - P
956 ``R_AMDGPU_REL64`` Static 5 ``word64`` S + A - P
957 ``R_AMDGPU_ABS32`` Static, 6 ``word32`` S + A
959 ``R_AMDGPU_GOTPCREL`` Static 7 ``word32`` G + GOT + A - P
960 ``R_AMDGPU_GOTPCREL32_LO`` Static 8 ``word32`` (G + GOT + A - P) & 0xFFFFFFFF
961 ``R_AMDGPU_GOTPCREL32_HI`` Static 9 ``word32`` (G + GOT + A - P) >> 32
962 ``R_AMDGPU_REL32_LO`` Static 10 ``word32`` (S + A - P) & 0xFFFFFFFF
963 ``R_AMDGPU_REL32_HI`` Static 11 ``word32`` (S + A - P) >> 32
965 ``R_AMDGPU_RELATIVE64`` Dynamic 13 ``word64`` B + A
966 ========================== ======= ===== ========== ==============================
968 ``R_AMDGPU_ABS32_LO`` and ``R_AMDGPU_ABS32_HI`` are only supported by
969 the ``mesa3d`` OS, which does not support ``R_AMDGPU_ABS64``.
971 There is no current OS loader support for 32 bit programs and so
972 ``R_AMDGPU_ABS32`` is not used.
979 Standard DWARF [DWARF]_ Version 5 sections can be generated. These contain
980 information that maps the code object executable code and data to the source
981 language constructs. It can be used by tools such as debuggers and profilers.
983 Address Space Mapping
984 ~~~~~~~~~~~~~~~~~~~~~
986 The following address space mapping is used:
988 .. table:: AMDGPU DWARF Address Space Mapping
989 :name: amdgpu-dwarf-address-space-mapping-table
991 =================== =================
992 DWARF Address Space Memory Space
993 =================== =================
998 *omitted* Generic (Flat)
999 *not supported* Region (GDS)
1000 =================== =================
1002 See :ref:`amdgpu-address-spaces` for information on the memory space terminology
1005 An ``address_class`` attribute is generated on pointer type DIEs to specify the
1006 DWARF address space of the value of the pointer when it is in the *private* or
1007 *local* address space. Otherwise the attribute is omitted.
1009 An ``XDEREF`` operation is generated in location list expressions for variables
1010 that are allocated in the *private* and *local* address space. Otherwise no
1011 ``XDREF`` is omitted.
1016 *This section is WIP.*
1019 Define DWARF register enumeration.
1021 If want to present a wavefront state then should expose vector registers as
1022 64 wide (rather than per work-item view that LLVM uses). Either as separate
1023 registers, or a 64x4 byte single register. In either case use a new LANE op
1024 (akin to XDREF) to select the current lane usage in a location
1025 expression. This would also allow scalar register spilling to vector register
1026 lanes to be expressed (currently no debug information is being generated for
1027 spilling). If choose a wide single register approach then use LANE in
1028 conjunction with PIECE operation to select the dword part of the register for
1029 the current lane. If the separate register approach then use LANE to select
1035 Source text for online-compiled programs (e.g. those compiled by the OpenCL
1036 runtime) may be embedded into the DWARF v5 line table using the ``clang
1037 -gembed-source`` option, described in table :ref:`amdgpu-debug-options`.
1042 Enable the embedded source DWARF v5 extension.
1043 ``-gno-embed-source``
1044 Disable the embedded source DWARF v5 extension.
1046 .. table:: AMDGPU Debug Options
1047 :name: amdgpu-debug-options
1049 ==================== ==================================================
1050 Debug Flag Description
1051 ==================== ==================================================
1052 -g[no-]embed-source Enable/disable embedding source text in DWARF
1053 debug sections. Useful for environments where
1054 source cannot be written to disk, such as
1055 when performing online compilation.
1056 ==================== ==================================================
1058 This option enables one extended content types in the DWARF v5 Line Number
1059 Program Header, which is used to encode embedded source.
1061 .. table:: AMDGPU DWARF Line Number Program Header Extended Content Types
1062 :name: amdgpu-dwarf-extended-content-types
1064 ============================ ======================
1066 ============================ ======================
1067 ``DW_LNCT_LLVM_source`` ``DW_FORM_line_strp``
1068 ============================ ======================
1070 The source field will contain the UTF-8 encoded, null-terminated source text
1071 with ``'\n'`` line endings. When the source field is present, consumers can use
1072 the embedded source instead of attempting to discover the source on disk. When
1073 the source field is absent, consumers can access the file to get the source
1076 The above content type appears in the ``file_name_entry_format`` field of the
1077 line table prologue, and its corresponding value appear in the ``file_names``
1078 field. The current encoding of the content type is documented in table
1079 :ref:`amdgpu-dwarf-extended-content-types-encoding`
1081 .. table:: AMDGPU DWARF Line Number Program Header Extended Content Types Encoding
1082 :name: amdgpu-dwarf-extended-content-types-encoding
1084 ============================ ====================
1086 ============================ ====================
1087 ``DW_LNCT_LLVM_source`` 0x2001
1088 ============================ ====================
1090 .. _amdgpu-code-conventions:
1095 This section provides code conventions used for each supported target triple OS
1096 (see :ref:`amdgpu-target-triples`).
1101 This section provides code conventions used when the target triple OS is
1102 ``amdhsa`` (see :ref:`amdgpu-target-triples`).
1104 .. _amdgpu-amdhsa-code-object-target-identification:
1106 Code Object Target Identification
1107 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1109 The AMDHSA OS uses the following syntax to specify the code object
1110 target as a single string:
1112 ``<Architecture>-<Vendor>-<OS>-<Environment>-<Processor><Target Features>``
1116 - ``<Architecture>``, ``<Vendor>``, ``<OS>`` and ``<Environment>``
1117 are the same as the *Target Triple* (see
1118 :ref:`amdgpu-target-triples`).
1120 - ``<Processor>`` is the same as the *Processor* (see
1121 :ref:`amdgpu-processors`).
1123 - ``<Target Features>`` is a list of the enabled *Target Features*
1124 (see :ref:`amdgpu-target-features`), each prefixed by a plus, that
1125 apply to *Processor*. The list must be in the same order as listed
1126 in the table :ref:`amdgpu-target-feature-table`. Note that *Target
1127 Features* must be included in the list if they are enabled even if
1128 that is the default for *Processor*.
1132 ``"amdgcn-amd-amdhsa--gfx902+xnack"``
1134 .. _amdgpu-amdhsa-code-object-metadata:
1136 Code Object Metadata
1137 ~~~~~~~~~~~~~~~~~~~~
1139 The code object metadata specifies extensible metadata associated with the code
1140 objects executed on HSA [HSA]_ compatible runtimes such as AMD's ROCm
1141 [AMD-ROCm]_. The encoding and semantics of this metadata depends on the code
1142 object version; see :ref:`amdgpu-amdhsa-code-object-metadata-v2` and
1143 :ref:`amdgpu-amdhsa-code-object-metadata-v3`.
1145 Code object metadata is specified in a note record (see
1146 :ref:`amdgpu-note-records`) and is required when the target triple OS is
1147 ``amdhsa`` (see :ref:`amdgpu-target-triples`). It must contain the minimum
1148 information necessary to support the ROCM kernel queries. For example, the
1149 segment sizes needed in a dispatch packet. In addition, a high level language
1150 runtime may require other information to be included. For example, the AMD
1151 OpenCL runtime records kernel argument information.
1153 .. _amdgpu-amdhsa-code-object-metadata-v2:
1155 Code Object V2 Metadata (-mattr=-code-object-v3)
1156 ++++++++++++++++++++++++++++++++++++++++++++++++
1158 .. warning:: Code Object V2 is not the default code object version emitted by
1159 this version of LLVM. For a description of the metadata generated with the
1160 default configuration (Code Object V3) see
1161 :ref:`amdgpu-amdhsa-code-object-metadata-v3`.
1163 Code object V2 metadata is specified by the ``NT_AMD_AMDGPU_METADATA`` note
1164 record (see :ref:`amdgpu-note-records-v2`).
1166 The metadata is specified as a YAML formatted string (see [YAML]_ and
1170 Is the string null terminated? It probably should not if YAML allows it to
1171 contain null characters, otherwise it should be.
1173 The metadata is represented as a single YAML document comprised of the mapping
1174 defined in table :ref:`amdgpu-amdhsa-code-object-metadata-map-table-v2` and
1177 For boolean values, the string values of ``false`` and ``true`` are used for
1178 false and true respectively.
1180 Additional information can be added to the mappings. To avoid conflicts, any
1181 non-AMD key names should be prefixed by "*vendor-name*.".
1183 .. table:: AMDHSA Code Object V2 Metadata Map
1184 :name: amdgpu-amdhsa-code-object-metadata-map-table-v2
1186 ========== ============== ========= =======================================
1187 String Key Value Type Required? Description
1188 ========== ============== ========= =======================================
1189 "Version" sequence of Required - The first integer is the major
1190 2 integers version. Currently 1.
1191 - The second integer is the minor
1192 version. Currently 0.
1193 "Printf" sequence of Each string is encoded information
1194 strings about a printf function call. The
1195 encoded information is organized as
1196 fields separated by colon (':'):
1198 ``ID:N:S[0]:S[1]:...:S[N-1]:FormatString``
1203 A 32 bit integer as a unique id for
1204 each printf function call
1207 A 32 bit integer equal to the number
1208 of arguments of printf function call
1211 ``S[i]`` (where i = 0, 1, ... , N-1)
1212 32 bit integers for the size in bytes
1213 of the i-th FormatString argument of
1214 the printf function call
1217 The format string passed to the
1218 printf function call.
1219 "Kernels" sequence of Required Sequence of the mappings for each
1220 mapping kernel in the code object. See
1221 :ref:`amdgpu-amdhsa-code-object-kernel-metadata-map-table-v2`
1222 for the definition of the mapping.
1223 ========== ============== ========= =======================================
1227 .. table:: AMDHSA Code Object V2 Kernel Metadata Map
1228 :name: amdgpu-amdhsa-code-object-kernel-metadata-map-table-v2
1230 ================= ============== ========= ================================
1231 String Key Value Type Required? Description
1232 ================= ============== ========= ================================
1233 "Name" string Required Source name of the kernel.
1234 "SymbolName" string Required Name of the kernel
1235 descriptor ELF symbol.
1236 "Language" string Source language of the kernel.
1244 "LanguageVersion" sequence of - The first integer is the major
1246 - The second integer is the
1248 "Attrs" mapping Mapping of kernel attributes.
1250 :ref:`amdgpu-amdhsa-code-object-kernel-attribute-metadata-map-table-v2`
1251 for the mapping definition.
1252 "Args" sequence of Sequence of mappings of the
1253 mapping kernel arguments. See
1254 :ref:`amdgpu-amdhsa-code-object-kernel-argument-metadata-map-table-v2`
1255 for the definition of the mapping.
1256 "CodeProps" mapping Mapping of properties related to
1257 the kernel code. See
1258 :ref:`amdgpu-amdhsa-code-object-kernel-code-properties-metadata-map-table-v2`
1259 for the mapping definition.
1260 ================= ============== ========= ================================
1264 .. table:: AMDHSA Code Object V2 Kernel Attribute Metadata Map
1265 :name: amdgpu-amdhsa-code-object-kernel-attribute-metadata-map-table-v2
1267 =================== ============== ========= ==============================
1268 String Key Value Type Required? Description
1269 =================== ============== ========= ==============================
1270 "ReqdWorkGroupSize" sequence of If not 0, 0, 0 then all values
1271 3 integers must be >=1 and the dispatch
1272 work-group size X, Y, Z must
1273 correspond to the specified
1274 values. Defaults to 0, 0, 0.
1276 Corresponds to the OpenCL
1277 ``reqd_work_group_size``
1279 "WorkGroupSizeHint" sequence of The dispatch work-group size
1280 3 integers X, Y, Z is likely to be the
1283 Corresponds to the OpenCL
1284 ``work_group_size_hint``
1286 "VecTypeHint" string The name of a scalar or vector
1289 Corresponds to the OpenCL
1290 ``vec_type_hint`` attribute.
1292 "RuntimeHandle" string The external symbol name
1293 associated with a kernel.
1294 OpenCL runtime allocates a
1295 global buffer for the symbol
1296 and saves the kernel's address
1297 to it, which is used for
1298 device side enqueueing. Only
1299 available for device side
1301 =================== ============== ========= ==============================
1305 .. table:: AMDHSA Code Object V2 Kernel Argument Metadata Map
1306 :name: amdgpu-amdhsa-code-object-kernel-argument-metadata-map-table-v2
1308 ================= ============== ========= ================================
1309 String Key Value Type Required? Description
1310 ================= ============== ========= ================================
1311 "Name" string Kernel argument name.
1312 "TypeName" string Kernel argument type name.
1313 "Size" integer Required Kernel argument size in bytes.
1314 "Align" integer Required Kernel argument alignment in
1315 bytes. Must be a power of two.
1316 "ValueKind" string Required Kernel argument kind that
1317 specifies how to set up the
1318 corresponding argument.
1322 The argument is copied
1323 directly into the kernarg.
1326 A global address space pointer
1327 to the buffer data is passed
1330 "DynamicSharedPointer"
1331 A group address space pointer
1332 to dynamically allocated LDS
1333 is passed in the kernarg.
1336 A global address space
1337 pointer to a S# is passed in
1341 A global address space
1342 pointer to a T# is passed in
1346 A global address space pointer
1347 to an OpenCL pipe is passed in
1351 A global address space pointer
1352 to an OpenCL device enqueue
1353 queue is passed in the
1356 "HiddenGlobalOffsetX"
1357 The OpenCL grid dispatch
1358 global offset for the X
1359 dimension is passed in the
1362 "HiddenGlobalOffsetY"
1363 The OpenCL grid dispatch
1364 global offset for the Y
1365 dimension is passed in the
1368 "HiddenGlobalOffsetZ"
1369 The OpenCL grid dispatch
1370 global offset for the Z
1371 dimension is passed in the
1375 An argument that is not used
1376 by the kernel. Space needs to
1377 be left for it, but it does
1378 not need to be set up.
1380 "HiddenPrintfBuffer"
1381 A global address space pointer
1382 to the runtime printf buffer
1383 is passed in kernarg.
1385 "HiddenDefaultQueue"
1386 A global address space pointer
1387 to the OpenCL device enqueue
1388 queue that should be used by
1389 the kernel by default is
1390 passed in the kernarg.
1392 "HiddenCompletionAction"
1393 A global address space pointer
1394 to help link enqueued kernels into
1395 the ancestor tree for determining
1396 when the parent kernel has finished.
1398 "HiddenMultiGridSyncArg"
1399 A global address space pointer for
1400 multi-grid synchronization is
1401 passed in the kernarg.
1403 "ValueType" string Required Kernel argument value type. Only
1404 present if "ValueKind" is
1405 "ByValue". For vector data
1406 types, the value is for the
1407 element type. Values include:
1423 How can it be determined if a
1424 vector type, and what size
1426 "PointeeAlign" integer Alignment in bytes of pointee
1427 type for pointer type kernel
1428 argument. Must be a power
1429 of 2. Only present if
1431 "DynamicSharedPointer".
1432 "AddrSpaceQual" string Kernel argument address space
1433 qualifier. Only present if
1434 "ValueKind" is "GlobalBuffer" or
1435 "DynamicSharedPointer". Values
1446 Is GlobalBuffer only Global
1448 DynamicSharedPointer always
1449 Local? Can HCC allow Generic?
1450 How can Private or Region
1452 "AccQual" string Kernel argument access
1453 qualifier. Only present if
1454 "ValueKind" is "Image" or
1465 "ActualAccQual" string The actual memory accesses
1466 performed by the kernel on the
1467 kernel argument. Only present if
1468 "ValueKind" is "GlobalBuffer",
1469 "Image", or "Pipe". This may be
1470 more restrictive than indicated
1471 by "AccQual" to reflect what the
1472 kernel actual does. If not
1473 present then the runtime must
1474 assume what is implied by
1475 "AccQual" and "IsConst". Values
1482 "IsConst" boolean Indicates if the kernel argument
1483 is const qualified. Only present
1487 "IsRestrict" boolean Indicates if the kernel argument
1488 is restrict qualified. Only
1489 present if "ValueKind" is
1492 "IsVolatile" boolean Indicates if the kernel argument
1493 is volatile qualified. Only
1494 present if "ValueKind" is
1497 "IsPipe" boolean Indicates if the kernel argument
1498 is pipe qualified. Only present
1499 if "ValueKind" is "Pipe".
1502 Can GlobalBuffer be pipe
1504 ================= ============== ========= ================================
1508 .. table:: AMDHSA Code Object V2 Kernel Code Properties Metadata Map
1509 :name: amdgpu-amdhsa-code-object-kernel-code-properties-metadata-map-table-v2
1511 ============================ ============== ========= =====================
1512 String Key Value Type Required? Description
1513 ============================ ============== ========= =====================
1514 "KernargSegmentSize" integer Required The size in bytes of
1516 that holds the values
1519 "GroupSegmentFixedSize" integer Required The amount of group
1523 bytes. This does not
1525 dynamically allocated
1526 group segment memory
1530 "PrivateSegmentFixedSize" integer Required The amount of fixed
1531 private address space
1532 memory required for a
1534 bytes. If the kernel
1536 stack then additional
1538 to this value for the
1540 "KernargSegmentAlign" integer Required The maximum byte
1543 kernarg segment. Must
1545 "WavefrontSize" integer Required Wavefront size. Must
1547 "NumSGPRs" integer Required Number of scalar
1551 includes the special
1553 Scratch (GFX7-GFX10)
1555 GFX8-GFX10). It does
1557 SGPR added if a trap
1563 "NumVGPRs" integer Required Number of vector
1567 "MaxFlatWorkGroupSize" integer Required Maximum flat
1570 kernel in work-items.
1573 ReqdWorkGroupSize if
1575 "NumSpilledSGPRs" integer Number of stores from
1576 a scalar register to
1577 a register allocator
1580 "NumSpilledVGPRs" integer Number of stores from
1581 a vector register to
1582 a register allocator
1585 ============================ ============== ========= =====================
1587 .. _amdgpu-amdhsa-code-object-metadata-v3:
1589 Code Object V3 Metadata (-mattr=+code-object-v3)
1590 ++++++++++++++++++++++++++++++++++++++++++++++++
1592 Code object V3 metadata is specified by the ``NT_AMDGPU_METADATA`` note record
1593 (see :ref:`amdgpu-note-records-v3`).
1595 The metadata is represented as Message Pack formatted binary data (see
1596 [MsgPack]_). The top level is a Message Pack map that includes the
1597 keys defined in table
1598 :ref:`amdgpu-amdhsa-code-object-metadata-map-table-v3` and referenced
1601 Additional information can be added to the maps. To avoid conflicts,
1602 any key names should be prefixed by "*vendor-name*." where
1603 ``vendor-name`` can be the the name of the vendor and specific vendor
1604 tool that generates the information. The prefix is abbreviated to
1605 simply "." when it appears within a map that has been added by the
1608 .. table:: AMDHSA Code Object V3 Metadata Map
1609 :name: amdgpu-amdhsa-code-object-metadata-map-table-v3
1611 ================= ============== ========= =======================================
1612 String Key Value Type Required? Description
1613 ================= ============== ========= =======================================
1614 "amdhsa.version" sequence of Required - The first integer is the major
1615 2 integers version. Currently 1.
1616 - The second integer is the minor
1617 version. Currently 0.
1618 "amdhsa.printf" sequence of Each string is encoded information
1619 strings about a printf function call. The
1620 encoded information is organized as
1621 fields separated by colon (':'):
1623 ``ID:N:S[0]:S[1]:...:S[N-1]:FormatString``
1628 A 32 bit integer as a unique id for
1629 each printf function call
1632 A 32 bit integer equal to the number
1633 of arguments of printf function call
1636 ``S[i]`` (where i = 0, 1, ... , N-1)
1637 32 bit integers for the size in bytes
1638 of the i-th FormatString argument of
1639 the printf function call
1642 The format string passed to the
1643 printf function call.
1644 "amdhsa.kernels" sequence of Required Sequence of the maps for each
1645 map kernel in the code object. See
1646 :ref:`amdgpu-amdhsa-code-object-kernel-metadata-map-table-v3`
1647 for the definition of the keys included
1649 ================= ============== ========= =======================================
1653 .. table:: AMDHSA Code Object V3 Kernel Metadata Map
1654 :name: amdgpu-amdhsa-code-object-kernel-metadata-map-table-v3
1656 =================================== ============== ========= ================================
1657 String Key Value Type Required? Description
1658 =================================== ============== ========= ================================
1659 ".name" string Required Source name of the kernel.
1660 ".symbol" string Required Name of the kernel
1661 descriptor ELF symbol.
1662 ".language" string Source language of the kernel.
1672 ".language_version" sequence of - The first integer is the major
1674 - The second integer is the
1676 ".args" sequence of Sequence of maps of the
1677 map kernel arguments. See
1678 :ref:`amdgpu-amdhsa-code-object-kernel-argument-metadata-map-table-v3`
1679 for the definition of the keys
1680 included in that map.
1681 ".reqd_workgroup_size" sequence of If not 0, 0, 0 then all values
1682 3 integers must be >=1 and the dispatch
1683 work-group size X, Y, Z must
1684 correspond to the specified
1685 values. Defaults to 0, 0, 0.
1687 Corresponds to the OpenCL
1688 ``reqd_work_group_size``
1690 ".workgroup_size_hint" sequence of The dispatch work-group size
1691 3 integers X, Y, Z is likely to be the
1694 Corresponds to the OpenCL
1695 ``work_group_size_hint``
1697 ".vec_type_hint" string The name of a scalar or vector
1700 Corresponds to the OpenCL
1701 ``vec_type_hint`` attribute.
1703 ".device_enqueue_symbol" string The external symbol name
1704 associated with a kernel.
1705 OpenCL runtime allocates a
1706 global buffer for the symbol
1707 and saves the kernel's address
1708 to it, which is used for
1709 device side enqueueing. Only
1710 available for device side
1712 ".kernarg_segment_size" integer Required The size in bytes of
1714 that holds the values
1717 ".group_segment_fixed_size" integer Required The amount of group
1721 bytes. This does not
1723 dynamically allocated
1724 group segment memory
1728 ".private_segment_fixed_size" integer Required The amount of fixed
1729 private address space
1730 memory required for a
1732 bytes. If the kernel
1734 stack then additional
1736 to this value for the
1738 ".kernarg_segment_align" integer Required The maximum byte
1741 kernarg segment. Must
1743 ".wavefront_size" integer Required Wavefront size. Must
1745 ".sgpr_count" integer Required Number of scalar
1746 registers required by a
1748 GFX6-GFX9. A register
1749 is required if it is
1751 if a higher numbered
1754 includes the special
1760 SGPR added if a trap
1766 ".vgpr_count" integer Required Number of vector
1767 registers required by
1769 GFX6-GFX9. A register
1770 is required if it is
1772 if a higher numbered
1775 ".max_flat_workgroup_size" integer Required Maximum flat
1778 kernel in work-items.
1781 ReqdWorkGroupSize if
1783 ".sgpr_spill_count" integer Number of stores from
1784 a scalar register to
1785 a register allocator
1788 ".vgpr_spill_count" integer Number of stores from
1789 a vector register to
1790 a register allocator
1793 =================================== ============== ========= ================================
1797 .. table:: AMDHSA Code Object V3 Kernel Argument Metadata Map
1798 :name: amdgpu-amdhsa-code-object-kernel-argument-metadata-map-table-v3
1800 ====================== ============== ========= ================================
1801 String Key Value Type Required? Description
1802 ====================== ============== ========= ================================
1803 ".name" string Kernel argument name.
1804 ".type_name" string Kernel argument type name.
1805 ".size" integer Required Kernel argument size in bytes.
1806 ".offset" integer Required Kernel argument offset in
1807 bytes. The offset must be a
1808 multiple of the alignment
1809 required by the argument.
1810 ".value_kind" string Required Kernel argument kind that
1811 specifies how to set up the
1812 corresponding argument.
1816 The argument is copied
1817 directly into the kernarg.
1820 A global address space pointer
1821 to the buffer data is passed
1824 "dynamic_shared_pointer"
1825 A group address space pointer
1826 to dynamically allocated LDS
1827 is passed in the kernarg.
1830 A global address space
1831 pointer to a S# is passed in
1835 A global address space
1836 pointer to a T# is passed in
1840 A global address space pointer
1841 to an OpenCL pipe is passed in
1845 A global address space pointer
1846 to an OpenCL device enqueue
1847 queue is passed in the
1850 "hidden_global_offset_x"
1851 The OpenCL grid dispatch
1852 global offset for the X
1853 dimension is passed in the
1856 "hidden_global_offset_y"
1857 The OpenCL grid dispatch
1858 global offset for the Y
1859 dimension is passed in the
1862 "hidden_global_offset_z"
1863 The OpenCL grid dispatch
1864 global offset for the Z
1865 dimension is passed in the
1869 An argument that is not used
1870 by the kernel. Space needs to
1871 be left for it, but it does
1872 not need to be set up.
1874 "hidden_printf_buffer"
1875 A global address space pointer
1876 to the runtime printf buffer
1877 is passed in kernarg.
1879 "hidden_default_queue"
1880 A global address space pointer
1881 to the OpenCL device enqueue
1882 queue that should be used by
1883 the kernel by default is
1884 passed in the kernarg.
1886 "hidden_completion_action"
1887 A global address space pointer
1888 to help link enqueued kernels into
1889 the ancestor tree for determining
1890 when the parent kernel has finished.
1892 "hidden_multigrid_sync_arg"
1893 A global address space pointer for
1894 multi-grid synchronization is
1895 passed in the kernarg.
1897 ".value_type" string Required Kernel argument value type. Only
1898 present if ".value_kind" is
1899 "by_value". For vector data
1900 types, the value is for the
1901 element type. Values include:
1917 How can it be determined if a
1918 vector type, and what size
1920 ".pointee_align" integer Alignment in bytes of pointee
1921 type for pointer type kernel
1922 argument. Must be a power
1923 of 2. Only present if
1925 "dynamic_shared_pointer".
1926 ".address_space" string Kernel argument address space
1927 qualifier. Only present if
1928 ".value_kind" is "global_buffer" or
1929 "dynamic_shared_pointer". Values
1940 Is "global_buffer" only "global"
1942 "dynamic_shared_pointer" always
1943 "local"? Can HCC allow "generic"?
1944 How can "private" or "region"
1946 ".access" string Kernel argument access
1947 qualifier. Only present if
1948 ".value_kind" is "image" or
1959 ".actual_access" string The actual memory accesses
1960 performed by the kernel on the
1961 kernel argument. Only present if
1962 ".value_kind" is "global_buffer",
1963 "image", or "pipe". This may be
1964 more restrictive than indicated
1965 by ".access" to reflect what the
1966 kernel actual does. If not
1967 present then the runtime must
1968 assume what is implied by
1969 ".access" and ".is_const" . Values
1976 ".is_const" boolean Indicates if the kernel argument
1977 is const qualified. Only present
1981 ".is_restrict" boolean Indicates if the kernel argument
1982 is restrict qualified. Only
1983 present if ".value_kind" is
1986 ".is_volatile" boolean Indicates if the kernel argument
1987 is volatile qualified. Only
1988 present if ".value_kind" is
1991 ".is_pipe" boolean Indicates if the kernel argument
1992 is pipe qualified. Only present
1993 if ".value_kind" is "pipe".
1996 Can "global_buffer" be pipe
1998 ====================== ============== ========= ================================
2005 The HSA architected queuing language (AQL) defines a user space memory interface
2006 that can be used to control the dispatch of kernels, in an agent independent
2007 way. An agent can have zero or more AQL queues created for it using the ROCm
2008 runtime, in which AQL packets (all of which are 64 bytes) can be placed. See the
2009 *HSA Platform System Architecture Specification* [HSA]_ for the AQL queue
2010 mechanics and packet layouts.
2012 The packet processor of a kernel agent is responsible for detecting and
2013 dispatching HSA kernels from the AQL queues associated with it. For AMD GPUs the
2014 packet processor is implemented by the hardware command processor (CP),
2015 asynchronous dispatch controller (ADC) and shader processor input controller
2018 The ROCm runtime can be used to allocate an AQL queue object. It uses the kernel
2019 mode driver to initialize and register the AQL queue with CP.
2021 To dispatch a kernel the following actions are performed. This can occur in the
2022 CPU host program, or from an HSA kernel executing on a GPU.
2024 1. A pointer to an AQL queue for the kernel agent on which the kernel is to be
2025 executed is obtained.
2026 2. A pointer to the kernel descriptor (see
2027 :ref:`amdgpu-amdhsa-kernel-descriptor`) of the kernel to execute is
2028 obtained. It must be for a kernel that is contained in a code object that that
2029 was loaded by the ROCm runtime on the kernel agent with which the AQL queue is
2031 3. Space is allocated for the kernel arguments using the ROCm runtime allocator
2032 for a memory region with the kernarg property for the kernel agent that will
2033 execute the kernel. It must be at least 16 byte aligned.
2034 4. Kernel argument values are assigned to the kernel argument memory
2035 allocation. The layout is defined in the *HSA Programmer's Language Reference*
2036 [HSA]_. For AMDGPU the kernel execution directly accesses the kernel argument
2037 memory in the same way constant memory is accessed. (Note that the HSA
2038 specification allows an implementation to copy the kernel argument contents to
2039 another location that is accessed by the kernel.)
2040 5. An AQL kernel dispatch packet is created on the AQL queue. The ROCm runtime
2041 api uses 64 bit atomic operations to reserve space in the AQL queue for the
2042 packet. The packet must be set up, and the final write must use an atomic
2043 store release to set the packet kind to ensure the packet contents are
2044 visible to the kernel agent. AQL defines a doorbell signal mechanism to
2045 notify the kernel agent that the AQL queue has been updated. These rules, and
2046 the layout of the AQL queue and kernel dispatch packet is defined in the *HSA
2047 System Architecture Specification* [HSA]_.
2048 6. A kernel dispatch packet includes information about the actual dispatch,
2049 such as grid and work-group size, together with information from the code
2050 object about the kernel, such as segment sizes. The ROCm runtime queries on
2051 the kernel symbol can be used to obtain the code object values which are
2052 recorded in the :ref:`amdgpu-amdhsa-code-object-metadata`.
2053 7. CP executes micro-code and is responsible for detecting and setting up the
2054 GPU to execute the wavefronts of a kernel dispatch.
2055 8. CP ensures that when the a wavefront starts executing the kernel machine
2056 code, the scalar general purpose registers (SGPR) and vector general purpose
2057 registers (VGPR) are set up as required by the machine code. The required
2058 setup is defined in the :ref:`amdgpu-amdhsa-kernel-descriptor`. The initial
2059 register state is defined in
2060 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`.
2061 9. The prolog of the kernel machine code (see
2062 :ref:`amdgpu-amdhsa-kernel-prolog`) sets up the machine state as necessary
2063 before continuing executing the machine code that corresponds to the kernel.
2064 10. When the kernel dispatch has completed execution, CP signals the completion
2065 signal specified in the kernel dispatch packet if not 0.
2067 .. _amdgpu-amdhsa-memory-spaces:
2072 The memory space properties are:
2074 .. table:: AMDHSA Memory Spaces
2075 :name: amdgpu-amdhsa-memory-spaces-table
2077 ================= =========== ======== ======= ==================
2078 Memory Space Name HSA Segment Hardware Address NULL Value
2080 ================= =========== ======== ======= ==================
2081 Private private scratch 32 0x00000000
2082 Local group LDS 32 0xFFFFFFFF
2083 Global global global 64 0x0000000000000000
2084 Constant constant *same as 64 0x0000000000000000
2086 Generic flat flat 64 0x0000000000000000
2087 Region N/A GDS 32 *not implemented
2089 ================= =========== ======== ======= ==================
2091 The global and constant memory spaces both use global virtual addresses, which
2092 are the same virtual address space used by the CPU. However, some virtual
2093 addresses may only be accessible to the CPU, some only accessible by the GPU,
2096 Using the constant memory space indicates that the data will not change during
2097 the execution of the kernel. This allows scalar read instructions to be
2098 used. The vector and scalar L1 caches are invalidated of volatile data before
2099 each kernel dispatch execution to allow constant memory to change values between
2102 The local memory space uses the hardware Local Data Store (LDS) which is
2103 automatically allocated when the hardware creates work-groups of wavefronts, and
2104 freed when all the wavefronts of a work-group have terminated. The data store
2105 (DS) instructions can be used to access it.
2107 The private memory space uses the hardware scratch memory support. If the kernel
2108 uses scratch, then the hardware allocates memory that is accessed using
2109 wavefront lane dword (4 byte) interleaving. The mapping used from private
2110 address to physical address is:
2112 ``wavefront-scratch-base +
2113 (private-address * wavefront-size * 4) +
2114 (wavefront-lane-id * 4)``
2116 There are different ways that the wavefront scratch base address is determined
2117 by a wavefront (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). This
2118 memory can be accessed in an interleaved manner using buffer instruction with
2119 the scratch buffer descriptor and per wavefront scratch offset, by the scratch
2120 instructions, or by flat instructions. If each lane of a wavefront accesses the
2121 same private address, the interleaving results in adjacent dwords being accessed
2122 and hence requires fewer cache lines to be fetched. Multi-dword access is not
2123 supported except by flat and scratch instructions in GFX9-GFX10.
2125 The generic address space uses the hardware flat address support available in
2126 GFX7-GFX10. This uses two fixed ranges of virtual addresses (the private and
2127 local appertures), that are outside the range of addressible global memory, to
2128 map from a flat address to a private or local address.
2130 FLAT instructions can take a flat address and access global, private (scratch)
2131 and group (LDS) memory depending in if the address is within one of the
2132 apperture ranges. Flat access to scratch requires hardware aperture setup and
2133 setup in the kernel prologue (see :ref:`amdgpu-amdhsa-flat-scratch`). Flat
2134 access to LDS requires hardware aperture setup and M0 (GFX7-GFX8) register setup
2135 (see :ref:`amdgpu-amdhsa-m0`).
2137 To convert between a segment address and a flat address the base address of the
2138 appertures address can be used. For GFX7-GFX8 these are available in the
2139 :ref:`amdgpu-amdhsa-hsa-aql-queue` the address of which can be obtained with
2140 Queue Ptr SGPR (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). For
2141 GFX9-GFX10 the appature base addresses are directly available as inline constant
2142 registers ``SRC_SHARED_BASE/LIMIT`` and ``SRC_PRIVATE_BASE/LIMIT``. In 64 bit
2143 address mode the apperture sizes are 2^32 bytes and the base is aligned to 2^32
2144 which makes it easier to convert from flat to segment or segment to flat.
2149 Image and sample handles created by the ROCm runtime are 64 bit addresses of a
2150 hardware 32 byte V# and 48 byte S# object respectively. In order to support the
2151 HSA ``query_sampler`` operations two extra dwords are used to store the HSA BRIG
2152 enumeration values for the queries that are not trivially deducible from the S#
2158 HSA signal handles created by the ROCm runtime are 64 bit addresses of a
2159 structure allocated in memory accessible from both the CPU and GPU. The
2160 structure is defined by the ROCm runtime and subject to change between releases
2161 (see [AMD-ROCm-github]_).
2163 .. _amdgpu-amdhsa-hsa-aql-queue:
2168 The HSA AQL queue structure is defined by the ROCm runtime and subject to change
2169 between releases (see [AMD-ROCm-github]_). For some processors it contains
2170 fields needed to implement certain language features such as the flat address
2171 aperture bases. It also contains fields used by CP such as managing the
2172 allocation of scratch memory.
2174 .. _amdgpu-amdhsa-kernel-descriptor:
2179 A kernel descriptor consists of the information needed by CP to initiate the
2180 execution of a kernel, including the entry point address of the machine code
2181 that implements the kernel.
2183 Kernel Descriptor for GFX6-GFX10
2184 ++++++++++++++++++++++++++++++++
2186 CP microcode requires the Kernel descriptor to be allocated on 64 byte
2189 .. table:: Kernel Descriptor for GFX6-GFX10
2190 :name: amdgpu-amdhsa-kernel-descriptor-gfx6-gfx10-table
2192 ======= ======= =============================== ============================
2193 Bits Size Field Name Description
2194 ======= ======= =============================== ============================
2195 31:0 4 bytes GROUP_SEGMENT_FIXED_SIZE The amount of fixed local
2196 address space memory
2197 required for a work-group
2198 in bytes. This does not
2199 include any dynamically
2200 allocated local address
2201 space memory that may be
2202 added when the kernel is
2204 63:32 4 bytes PRIVATE_SEGMENT_FIXED_SIZE The amount of fixed
2205 private address space
2206 memory required for a
2207 work-item in bytes. If
2208 is_dynamic_callstack is 1
2209 then additional space must
2210 be added to this value for
2212 127:64 8 bytes Reserved, must be 0.
2213 191:128 8 bytes KERNEL_CODE_ENTRY_BYTE_OFFSET Byte offset (possibly
2216 descriptor to kernel's
2217 entry point instruction
2218 which must be 256 byte
2220 351:272 20 Reserved, must be 0.
2222 383:352 4 bytes COMPUTE_PGM_RSRC3 GFX6-9
2223 Reserved, must be 0.
2226 program settings used by
2228 ``COMPUTE_PGM_RSRC3``
2231 :ref:`amdgpu-amdhsa-compute_pgm_rsrc3-gfx10-table`.
2232 415:384 4 bytes COMPUTE_PGM_RSRC1 Compute Shader (CS)
2233 program settings used by
2235 ``COMPUTE_PGM_RSRC1``
2238 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
2239 447:416 4 bytes COMPUTE_PGM_RSRC2 Compute Shader (CS)
2240 program settings used by
2242 ``COMPUTE_PGM_RSRC2``
2245 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`.
2246 448 1 bit ENABLE_SGPR_PRIVATE_SEGMENT Enable the setup of the
2247 _BUFFER SGPR user data registers
2249 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
2251 The total number of SGPR
2253 requested must not exceed
2254 16 and match value in
2255 ``compute_pgm_rsrc2.user_sgpr.user_sgpr_count``.
2256 Any requests beyond 16
2258 449 1 bit ENABLE_SGPR_DISPATCH_PTR *see above*
2259 450 1 bit ENABLE_SGPR_QUEUE_PTR *see above*
2260 451 1 bit ENABLE_SGPR_KERNARG_SEGMENT_PTR *see above*
2261 452 1 bit ENABLE_SGPR_DISPATCH_ID *see above*
2262 453 1 bit ENABLE_SGPR_FLAT_SCRATCH_INIT *see above*
2263 454 1 bit ENABLE_SGPR_PRIVATE_SEGMENT *see above*
2265 457:455 3 bits Reserved, must be 0.
2266 458 1 bit ENABLE_WAVEFRONT_SIZE32 GFX6-9
2267 Reserved, must be 0.
2270 wavefront size 64 mode.
2272 native wavefront size
2274 463:459 5 bits Reserved, must be 0.
2275 511:464 6 bytes Reserved, must be 0.
2276 512 **Total size 64 bytes.**
2277 ======= ====================================================================
2281 .. table:: compute_pgm_rsrc1 for GFX6-GFX10
2282 :name: amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table
2284 ======= ======= =============================== ===========================================================================
2285 Bits Size Field Name Description
2286 ======= ======= =============================== ===========================================================================
2287 5:0 6 bits GRANULATED_WORKITEM_VGPR_COUNT Number of vector register
2288 blocks used by each work-item;
2289 granularity is device
2294 - max(0, ceil(vgprs_used / 4) - 1)
2295 GFX10 (wavefront size 64)
2297 - max(0, ceil(vgprs_used / 4) - 1)
2298 GFX10 (wavefront size 32)
2300 - max(0, ceil(vgprs_used / 8) - 1)
2302 Where vgprs_used is defined
2303 as the highest VGPR number
2304 explicitly referenced plus
2307 Used by CP to set up
2308 ``COMPUTE_PGM_RSRC1.VGPRS``.
2311 :ref:`amdgpu-assembler`
2313 automatically for the
2314 selected processor from
2315 values provided to the
2316 `.amdhsa_kernel` directive
2318 `.amdhsa_next_free_vgpr`
2319 nested directive (see
2320 :ref:`amdhsa-kernel-directives-table`).
2321 9:6 4 bits GRANULATED_WAVEFRONT_SGPR_COUNT Number of scalar register
2322 blocks used by a wavefront;
2323 granularity is device
2328 - max(0, ceil(sgprs_used / 8) - 1)
2331 - 2 * max(0, ceil(sgprs_used / 16) - 1)
2333 Reserved, must be 0.
2338 defined as the highest
2339 SGPR number explicitly
2340 referenced plus one, plus
2341 a target-specific number
2342 of additional special
2344 FLAT_SCRATCH (GFX7+) and
2345 XNACK_MASK (GFX8+), and
2348 limitations. It does not
2349 include the 16 SGPRs added
2350 if a trap handler is
2354 limitations and special
2355 SGPR layout are defined in
2357 documentation, which can
2359 :ref:`amdgpu-processors`
2362 Used by CP to set up
2363 ``COMPUTE_PGM_RSRC1.SGPRS``.
2366 :ref:`amdgpu-assembler`
2368 automatically for the
2369 selected processor from
2370 values provided to the
2371 `.amdhsa_kernel` directive
2373 `.amdhsa_next_free_sgpr`
2374 and `.amdhsa_reserve_*`
2375 nested directives (see
2376 :ref:`amdhsa-kernel-directives-table`).
2377 11:10 2 bits PRIORITY Must be 0.
2379 Start executing wavefront
2380 at the specified priority.
2382 CP is responsible for
2384 ``COMPUTE_PGM_RSRC1.PRIORITY``.
2385 13:12 2 bits FLOAT_ROUND_MODE_32 Wavefront starts execution
2386 with specified rounding
2389 precision floating point
2392 Floating point rounding
2393 mode values are defined in
2394 :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
2396 Used by CP to set up
2397 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
2398 15:14 2 bits FLOAT_ROUND_MODE_16_64 Wavefront starts execution
2399 with specified rounding
2400 denorm mode for half/double (16
2401 and 64 bit) floating point
2402 precision floating point
2405 Floating point rounding
2406 mode values are defined in
2407 :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
2409 Used by CP to set up
2410 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
2411 17:16 2 bits FLOAT_DENORM_MODE_32 Wavefront starts execution
2412 with specified denorm mode
2415 precision floating point
2418 Floating point denorm mode
2419 values are defined in
2420 :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
2422 Used by CP to set up
2423 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
2424 19:18 2 bits FLOAT_DENORM_MODE_16_64 Wavefront starts execution
2425 with specified denorm mode
2427 and 64 bit) floating point
2428 precision floating point
2431 Floating point denorm mode
2432 values are defined in
2433 :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
2435 Used by CP to set up
2436 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
2437 20 1 bit PRIV Must be 0.
2439 Start executing wavefront
2440 in privilege trap handler
2443 CP is responsible for
2445 ``COMPUTE_PGM_RSRC1.PRIV``.
2446 21 1 bit ENABLE_DX10_CLAMP Wavefront starts execution
2447 with DX10 clamp mode
2448 enabled. Used by the vector
2449 ALU to force DX10 style
2450 treatment of NaN's (when
2451 set, clamp NaN to zero,
2455 Used by CP to set up
2456 ``COMPUTE_PGM_RSRC1.DX10_CLAMP``.
2457 22 1 bit DEBUG_MODE Must be 0.
2459 Start executing wavefront
2460 in single step mode.
2462 CP is responsible for
2464 ``COMPUTE_PGM_RSRC1.DEBUG_MODE``.
2465 23 1 bit ENABLE_IEEE_MODE Wavefront starts execution
2467 enabled. Floating point
2468 opcodes that support
2469 exception flag gathering
2470 will quiet and propagate
2471 signaling-NaN inputs per
2472 IEEE 754-2008. Min_dx10 and
2473 max_dx10 become IEEE
2474 754-2008 compliant due to
2475 signaling-NaN propagation
2478 Used by CP to set up
2479 ``COMPUTE_PGM_RSRC1.IEEE_MODE``.
2480 24 1 bit BULKY Must be 0.
2482 Only one work-group allowed
2483 to execute on a compute
2486 CP is responsible for
2488 ``COMPUTE_PGM_RSRC1.BULKY``.
2489 25 1 bit CDBG_USER Must be 0.
2491 Flag that can be used to
2492 control debugging code.
2494 CP is responsible for
2496 ``COMPUTE_PGM_RSRC1.CDBG_USER``.
2497 26 1 bit FP16_OVFL GFX6-GFX8
2498 Reserved, must be 0.
2500 Wavefront starts execution
2501 with specified fp16 overflow
2504 - If 0, fp16 overflow generates
2506 - If 1, fp16 overflow that is the
2507 result of an +/-INF input value
2508 or divide by 0 produces a +/-INF,
2509 otherwise clamps computed
2510 overflow to +/-MAX_FP16 as
2513 Used by CP to set up
2514 ``COMPUTE_PGM_RSRC1.FP16_OVFL``.
2515 28:27 2 bits Reserved, must be 0.
2516 29 1 bit WGP_MODE GFX6-GFX9
2517 Reserved, must be 0.
2519 - If 0 execute work-groups in
2520 CU wavefront execution mode.
2521 - If 1 execute work-groups on
2522 in WGP wavefront execution mode.
2524 See :ref:`amdgpu-amdhsa-memory-model`.
2526 Used by CP to set up
2527 ``COMPUTE_PGM_RSRC1.WGP_MODE``.
2528 30 1 bit MEM_ORDERED GFX6-9
2529 Reserved, must be 0.
2531 Controls the behavior of the
2532 waitcnt's vmcnt and vscnt
2535 - If 0 vmcnt reports completion
2536 of load and atomic with return
2537 out of order with sample
2538 instructions, and the vscnt
2539 reports the completion of
2540 store and atomic without
2542 - If 1 vmcnt reports completion
2543 of load, atomic with return
2544 and sample instructions in
2545 order, and the vscnt reports
2546 the completion of store and
2547 atomic without return in order.
2549 Used by CP to set up
2550 ``COMPUTE_PGM_RSRC1.MEM_ORDERED``.
2551 31 1 bit FWD_PROGRESS GFX6-9
2552 Reserved, must be 0.
2554 - If 0 execute SIMD wavefronts
2555 using oldest first policy.
2556 - If 1 execute SIMD wavefronts to
2557 ensure wavefronts will make some
2560 Used by CP to set up
2561 ``COMPUTE_PGM_RSRC1.FWD_PROGRESS``.
2562 32 **Total size 4 bytes**
2563 ======= ===================================================================================================================
2567 .. table:: compute_pgm_rsrc2 for GFX6-GFX10
2568 :name: amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table
2570 ======= ======= =============================== ===========================================================================
2571 Bits Size Field Name Description
2572 ======= ======= =============================== ===========================================================================
2573 0 1 bit ENABLE_SGPR_PRIVATE_SEGMENT Enable the setup of the
2574 _WAVEFRONT_OFFSET SGPR wavefront scratch offset
2575 system register (see
2576 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
2578 Used by CP to set up
2579 ``COMPUTE_PGM_RSRC2.SCRATCH_EN``.
2580 5:1 5 bits USER_SGPR_COUNT The total number of SGPR
2582 requested. This number must
2583 match the number of user
2584 data registers enabled.
2586 Used by CP to set up
2587 ``COMPUTE_PGM_RSRC2.USER_SGPR``.
2588 6 1 bit ENABLE_TRAP_HANDLER Must be 0.
2591 ``COMPUTE_PGM_RSRC2.TRAP_PRESENT``,
2592 which is set by the CP if
2593 the runtime has installed a
2595 7 1 bit ENABLE_SGPR_WORKGROUP_ID_X Enable the setup of the
2596 system SGPR register for
2597 the work-group id in the X
2599 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
2601 Used by CP to set up
2602 ``COMPUTE_PGM_RSRC2.TGID_X_EN``.
2603 8 1 bit ENABLE_SGPR_WORKGROUP_ID_Y Enable the setup of the
2604 system SGPR register for
2605 the work-group id in the Y
2607 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
2609 Used by CP to set up
2610 ``COMPUTE_PGM_RSRC2.TGID_Y_EN``.
2611 9 1 bit ENABLE_SGPR_WORKGROUP_ID_Z Enable the setup of the
2612 system SGPR register for
2613 the work-group id in the Z
2615 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
2617 Used by CP to set up
2618 ``COMPUTE_PGM_RSRC2.TGID_Z_EN``.
2619 10 1 bit ENABLE_SGPR_WORKGROUP_INFO Enable the setup of the
2620 system SGPR register for
2621 work-group information (see
2622 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
2624 Used by CP to set up
2625 ``COMPUTE_PGM_RSRC2.TGID_SIZE_EN``.
2626 12:11 2 bits ENABLE_VGPR_WORKITEM_ID Enable the setup of the
2627 VGPR system registers used
2628 for the work-item ID.
2629 :ref:`amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table`
2632 Used by CP to set up
2633 ``COMPUTE_PGM_RSRC2.TIDIG_CMP_CNT``.
2634 13 1 bit ENABLE_EXCEPTION_ADDRESS_WATCH Must be 0.
2636 Wavefront starts execution
2638 exceptions enabled which
2639 are generated when L1 has
2640 witnessed a thread access
2644 CP is responsible for
2645 filling in the address
2647 ``COMPUTE_PGM_RSRC2.EXCP_EN_MSB``
2648 according to what the
2650 14 1 bit ENABLE_EXCEPTION_MEMORY Must be 0.
2652 Wavefront starts execution
2653 with memory violation
2654 exceptions exceptions
2655 enabled which are generated
2656 when a memory violation has
2657 occurred for this wavefront from
2659 (write-to-read-only-memory,
2660 mis-aligned atomic, LDS
2661 address out of range,
2662 illegal address, etc.).
2666 ``COMPUTE_PGM_RSRC2.EXCP_EN_MSB``
2667 according to what the
2669 23:15 9 bits GRANULATED_LDS_SIZE Must be 0.
2671 CP uses the rounded value
2672 from the dispatch packet,
2673 not this value, as the
2674 dispatch may contain
2675 dynamically allocated group
2676 segment memory. CP writes
2678 ``COMPUTE_PGM_RSRC2.LDS_SIZE``.
2680 Amount of group segment
2681 (LDS) to allocate for each
2682 work-group. Granularity is
2686 roundup(lds-size / (64 * 4))
2688 roundup(lds-size / (128 * 4))
2690 24 1 bit ENABLE_EXCEPTION_IEEE_754_FP Wavefront starts execution
2691 _INVALID_OPERATION with specified exceptions
2694 Used by CP to set up
2695 ``COMPUTE_PGM_RSRC2.EXCP_EN``
2696 (set from bits 0..6).
2700 25 1 bit ENABLE_EXCEPTION_FP_DENORMAL FP Denormal one or more
2701 _SOURCE input operands is a
2703 26 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Division by
2704 _DIVISION_BY_ZERO Zero
2705 27 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP FP Overflow
2707 28 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Underflow
2709 29 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Inexact
2711 30 1 bit ENABLE_EXCEPTION_INT_DIVIDE_BY Integer Division by Zero
2712 _ZERO (rcp_iflag_f32 instruction
2714 31 1 bit Reserved, must be 0.
2715 32 **Total size 4 bytes.**
2716 ======= ===================================================================================================================
2720 .. table:: compute_pgm_rsrc3 for GFX10
2721 :name: amdgpu-amdhsa-compute_pgm_rsrc3-gfx10-table
2723 ======= ======= =============================== ===========================================================================
2724 Bits Size Field Name Description
2725 ======= ======= =============================== ===========================================================================
2726 3:0 4 bits SHARED_VGPR_COUNT Number of shared VGPRs for wavefront size 64. Granularity 8. Value 0-120.
2727 compute_pgm_rsrc1.vgprs + shared_vgpr_cnt cannot exceed 64.
2728 31:4 28 Reserved, must be 0.
2730 32 **Total size 4 bytes.**
2731 ======= ===================================================================================================================
2735 .. table:: Floating Point Rounding Mode Enumeration Values
2736 :name: amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table
2738 ====================================== ===== ==============================
2739 Enumeration Name Value Description
2740 ====================================== ===== ==============================
2741 FLOAT_ROUND_MODE_NEAR_EVEN 0 Round Ties To Even
2742 FLOAT_ROUND_MODE_PLUS_INFINITY 1 Round Toward +infinity
2743 FLOAT_ROUND_MODE_MINUS_INFINITY 2 Round Toward -infinity
2744 FLOAT_ROUND_MODE_ZERO 3 Round Toward 0
2745 ====================================== ===== ==============================
2749 .. table:: Floating Point Denorm Mode Enumeration Values
2750 :name: amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table
2752 ====================================== ===== ==============================
2753 Enumeration Name Value Description
2754 ====================================== ===== ==============================
2755 FLOAT_DENORM_MODE_FLUSH_SRC_DST 0 Flush Source and Destination
2757 FLOAT_DENORM_MODE_FLUSH_DST 1 Flush Output Denorms
2758 FLOAT_DENORM_MODE_FLUSH_SRC 2 Flush Source Denorms
2759 FLOAT_DENORM_MODE_FLUSH_NONE 3 No Flush
2760 ====================================== ===== ==============================
2764 .. table:: System VGPR Work-Item ID Enumeration Values
2765 :name: amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table
2767 ======================================== ===== ============================
2768 Enumeration Name Value Description
2769 ======================================== ===== ============================
2770 SYSTEM_VGPR_WORKITEM_ID_X 0 Set work-item X dimension
2772 SYSTEM_VGPR_WORKITEM_ID_X_Y 1 Set work-item X and Y
2774 SYSTEM_VGPR_WORKITEM_ID_X_Y_Z 2 Set work-item X, Y and Z
2776 SYSTEM_VGPR_WORKITEM_ID_UNDEFINED 3 Undefined.
2777 ======================================== ===== ============================
2779 .. _amdgpu-amdhsa-initial-kernel-execution-state:
2781 Initial Kernel Execution State
2782 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
2784 This section defines the register state that will be set up by the packet
2785 processor prior to the start of execution of every wavefront. This is limited by
2786 the constraints of the hardware controllers of CP/ADC/SPI.
2788 The order of the SGPR registers is defined, but the compiler can specify which
2789 ones are actually setup in the kernel descriptor using the ``enable_sgpr_*`` bit
2790 fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used
2791 for enabled registers are dense starting at SGPR0: the first enabled register is
2792 SGPR0, the next enabled register is SGPR1 etc.; disabled registers do not have
2795 The initial SGPRs comprise up to 16 User SRGPs that are set by CP and apply to
2796 all wavefronts of the grid. It is possible to specify more than 16 User SGPRs using
2797 the ``enable_sgpr_*`` bit fields, in which case only the first 16 are actually
2798 initialized. These are then immediately followed by the System SGPRs that are
2799 set up by ADC/SPI and can have different values for each wavefront of the grid
2802 SGPR register initial state is defined in
2803 :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
2805 .. table:: SGPR Register Set Up Order
2806 :name: amdgpu-amdhsa-sgpr-register-set-up-order-table
2808 ========== ========================== ====== ==============================
2809 SGPR Order Name Number Description
2810 (kernel descriptor enable of
2812 ========== ========================== ====== ==============================
2813 First Private Segment Buffer 4 V# that can be used, together
2814 (enable_sgpr_private with Scratch Wavefront Offset
2815 _segment_buffer) as an offset, to access the
2816 private memory space using a
2819 CP uses the value provided by
2821 then Dispatch Ptr 2 64 bit address of AQL dispatch
2822 (enable_sgpr_dispatch_ptr) packet for kernel dispatch
2824 then Queue Ptr 2 64 bit address of amd_queue_t
2825 (enable_sgpr_queue_ptr) object for AQL queue on which
2826 the dispatch packet was
2828 then Kernarg Segment Ptr 2 64 bit address of Kernarg
2829 (enable_sgpr_kernarg segment. This is directly
2830 _segment_ptr) copied from the
2831 kernarg_address in the kernel
2834 Having CP load it once avoids
2835 loading it at the beginning of
2837 then Dispatch Id 2 64 bit Dispatch ID of the
2838 (enable_sgpr_dispatch_id) dispatch packet being
2840 then Flat Scratch Init 2 This is 2 SGPRs:
2841 (enable_sgpr_flat_scratch
2845 The first SGPR is a 32 bit
2847 ``SH_HIDDEN_PRIVATE_BASE_VIMID``
2848 to per SPI base of memory
2849 for scratch for the queue
2850 executing the kernel
2851 dispatch. CP obtains this
2852 from the runtime. (The
2853 Scratch Segment Buffer base
2855 ``SH_HIDDEN_PRIVATE_BASE_VIMID``
2856 plus this offset.) The value
2857 of Scratch Wavefront Offset must
2858 be added to this offset by
2859 the kernel machine code,
2860 right shifted by 8, and
2861 moved to the FLAT_SCRATCH_HI
2863 FLAT_SCRATCH_HI corresponds
2864 to SGPRn-4 on GFX7, and
2865 SGPRn-6 on GFX8 (where SGPRn
2866 is the highest numbered SGPR
2867 allocated to the wavefront).
2869 multiplied by 256 (as it is
2870 in units of 256 bytes) and
2872 ``SH_HIDDEN_PRIVATE_BASE_VIMID``
2873 to calculate the per wavefront
2874 FLAT SCRATCH BASE in flat
2875 memory instructions that
2879 The second SGPR is 32 bit
2880 byte size of a single
2881 work-item's scratch memory
2882 usage. CP obtains this from
2883 the runtime, and it is
2884 always a multiple of DWORD.
2885 CP checks that the value in
2886 the kernel dispatch packet
2887 Private Segment Byte Size is
2888 not larger, and requests the
2889 runtime to increase the
2890 queue's scratch size if
2891 necessary. The kernel code
2893 FLAT_SCRATCH_LO which is
2894 SGPRn-3 on GFX7 and SGPRn-5
2895 on GFX8. FLAT_SCRATCH_LO is
2896 used as the FLAT SCRATCH
2898 instructions. Having CP load
2899 it once avoids loading it at
2900 the beginning of every
2904 64 bit base address of the
2905 per SPI scratch backing
2906 memory managed by SPI for
2907 the queue executing the
2908 kernel dispatch. CP obtains
2909 this from the runtime (and
2910 divides it if there are
2911 multiple Shader Arrays each
2912 with its own SPI). The value
2913 of Scratch Wavefront Offset must
2914 be added by the kernel
2915 machine code and the result
2916 moved to the FLAT_SCRATCH
2917 SGPR which is SGPRn-6 and
2918 SGPRn-5. It is used as the
2919 FLAT SCRATCH BASE in flat
2920 memory instructions.
2921 then Private Segment Size 1 The 32 bit byte size of a
2922 (enable_sgpr_private single
2924 scratch_segment_size) memory
2925 allocation. This is the
2926 value from the kernel
2927 dispatch packet Private
2928 Segment Byte Size rounded up
2929 by CP to a multiple of
2932 Having CP load it once avoids
2933 loading it at the beginning of
2936 This is not used for
2937 GFX7-GFX8 since it is the same
2938 value as the second SGPR of
2939 Flat Scratch Init. However, it
2940 may be needed for GFX9-GFX10 which
2941 changes the meaning of the
2942 Flat Scratch Init value.
2943 then Grid Work-Group Count X 1 32 bit count of the number of
2944 (enable_sgpr_grid work-groups in the X dimension
2945 _workgroup_count_X) for the grid being
2946 executed. Computed from the
2947 fields in the kernel dispatch
2948 packet as ((grid_size.x +
2949 workgroup_size.x - 1) /
2951 then Grid Work-Group Count Y 1 32 bit count of the number of
2952 (enable_sgpr_grid work-groups in the Y dimension
2953 _workgroup_count_Y && for the grid being
2954 less than 16 previous executed. Computed from the
2955 SGPRs) fields in the kernel dispatch
2956 packet as ((grid_size.y +
2957 workgroup_size.y - 1) /
2960 Only initialized if <16
2961 previous SGPRs initialized.
2962 then Grid Work-Group Count Z 1 32 bit count of the number of
2963 (enable_sgpr_grid work-groups in the Z dimension
2964 _workgroup_count_Z && for the grid being
2965 less than 16 previous executed. Computed from the
2966 SGPRs) fields in the kernel dispatch
2967 packet as ((grid_size.z +
2968 workgroup_size.z - 1) /
2971 Only initialized if <16
2972 previous SGPRs initialized.
2973 then Work-Group Id X 1 32 bit work-group id in X
2974 (enable_sgpr_workgroup_id dimension of grid for
2976 then Work-Group Id Y 1 32 bit work-group id in Y
2977 (enable_sgpr_workgroup_id dimension of grid for
2979 then Work-Group Id Z 1 32 bit work-group id in Z
2980 (enable_sgpr_workgroup_id dimension of grid for
2982 then Work-Group Info 1 {first_wavefront, 14'b0000,
2983 (enable_sgpr_workgroup ordered_append_term[10:0],
2984 _info) threadgroup_size_in_wavefronts[5:0]}
2985 then Scratch Wavefront Offset 1 32 bit byte offset from base
2986 (enable_sgpr_private of scratch base of queue
2987 _segment_wavefront_offset) executing the kernel
2988 dispatch. Must be used as an
2990 segment address when using
2991 Scratch Segment Buffer. It
2992 must be used to set up FLAT
2993 SCRATCH for flat addressing
2995 :ref:`amdgpu-amdhsa-flat-scratch`).
2996 ========== ========================== ====== ==============================
2998 The order of the VGPR registers is defined, but the compiler can specify which
2999 ones are actually setup in the kernel descriptor using the ``enable_vgpr*`` bit
3000 fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used
3001 for enabled registers are dense starting at VGPR0: the first enabled register is
3002 VGPR0, the next enabled register is VGPR1 etc.; disabled registers do not have a
3005 VGPR register initial state is defined in
3006 :ref:`amdgpu-amdhsa-vgpr-register-set-up-order-table`.
3008 .. table:: VGPR Register Set Up Order
3009 :name: amdgpu-amdhsa-vgpr-register-set-up-order-table
3011 ========== ========================== ====== ==============================
3012 VGPR Order Name Number Description
3013 (kernel descriptor enable of
3015 ========== ========================== ====== ==============================
3016 First Work-Item Id X 1 32 bit work item id in X
3017 (Always initialized) dimension of work-group for
3019 then Work-Item Id Y 1 32 bit work item id in Y
3020 (enable_vgpr_workitem_id dimension of work-group for
3021 > 0) wavefront lane.
3022 then Work-Item Id Z 1 32 bit work item id in Z
3023 (enable_vgpr_workitem_id dimension of work-group for
3024 > 1) wavefront lane.
3025 ========== ========================== ====== ==============================
3027 The setting of registers is done by GPU CP/ADC/SPI hardware as follows:
3029 1. SGPRs before the Work-Group Ids are set by CP using the 16 User Data
3031 2. Work-group Id registers X, Y, Z are set by ADC which supports any
3032 combination including none.
3033 3. Scratch Wavefront Offset is set by SPI in a per wavefront basis which is why
3034 its value cannot included with the flat scratch init value which is per queue.
3035 4. The VGPRs are set by SPI which only supports specifying either (X), (X, Y)
3038 Flat Scratch register pair are adjacent SGRRs so they can be moved as a 64 bit
3039 value to the hardware required SGPRn-3 and SGPRn-4 respectively.
3041 The global segment can be accessed either using buffer instructions (GFX6 which
3042 has V# 64 bit address support), flat instructions (GFX7-GFX10), or global
3043 instructions (GFX9-GFX10).
3045 If buffer operations are used then the compiler can generate a V# with the
3046 following properties:
3050 * ATC: 1 if IOMMU present (such as APU)
3052 * MTYPE set to support memory coherence that matches the runtime (such as CC for
3053 APU and NC for dGPU).
3055 .. _amdgpu-amdhsa-kernel-prolog:
3060 .. _amdgpu-amdhsa-m0:
3066 The M0 register must be initialized with a value at least the total LDS size
3067 if the kernel may access LDS via DS or flat operations. Total LDS size is
3068 available in dispatch packet. For M0, it is also possible to use maximum
3069 possible value of LDS for given target (0x7FFF for GFX6 and 0xFFFF for
3072 The M0 register is not used for range checking LDS accesses and so does not
3073 need to be initialized in the prolog.
3075 .. _amdgpu-amdhsa-flat-scratch:
3080 If the kernel may use flat operations to access scratch memory, the prolog code
3081 must set up FLAT_SCRATCH register pair (FLAT_SCRATCH_LO/FLAT_SCRATCH_HI which
3082 are in SGPRn-4/SGPRn-3). Initialization uses Flat Scratch Init and Scratch Wavefront
3083 Offset SGPR registers (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`):
3086 Flat scratch is not supported.
3089 1. The low word of Flat Scratch Init is 32 bit byte offset from
3090 ``SH_HIDDEN_PRIVATE_BASE_VIMID`` to the base of scratch backing memory
3091 being managed by SPI for the queue executing the kernel dispatch. This is
3092 the same value used in the Scratch Segment Buffer V# base address. The
3093 prolog must add the value of Scratch Wavefront Offset to get the wavefront's byte
3094 scratch backing memory offset from ``SH_HIDDEN_PRIVATE_BASE_VIMID``. Since
3095 FLAT_SCRATCH_LO is in units of 256 bytes, the offset must be right shifted
3096 by 8 before moving into FLAT_SCRATCH_LO.
3097 2. The second word of Flat Scratch Init is 32 bit byte size of a single
3098 work-items scratch memory usage. This is directly loaded from the kernel
3099 dispatch packet Private Segment Byte Size and rounded up to a multiple of
3100 DWORD. Having CP load it once avoids loading it at the beginning of every
3101 wavefront. The prolog must move it to FLAT_SCRATCH_LO for use as FLAT SCRATCH
3105 The Flat Scratch Init is the 64 bit address of the base of scratch backing
3106 memory being managed by SPI for the queue executing the kernel dispatch. The
3107 prolog must add the value of Scratch Wavefront Offset and moved to the FLAT_SCRATCH
3108 pair for use as the flat scratch base in flat memory instructions.
3110 .. _amdgpu-amdhsa-memory-model:
3115 This section describes the mapping of LLVM memory model onto AMDGPU machine code
3116 (see :ref:`memmodel`). *The implementation is WIP.*
3119 Update when implementation complete.
3121 The AMDGPU backend supports the memory synchronization scopes specified in
3122 :ref:`amdgpu-memory-scopes`.
3124 The code sequences used to implement the memory model are defined in table
3125 :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx10-table`.
3127 The sequences specify the order of instructions that a single thread must
3128 execute. The ``s_waitcnt`` and ``buffer_wbinvl1_vol`` are defined with respect
3129 to other memory instructions executed by the same thread. This allows them to be
3130 moved earlier or later which can allow them to be combined with other instances
3131 of the same instruction, or hoisted/sunk out of loops to improve
3132 performance. Only the instructions related to the memory model are given;
3133 additional ``s_waitcnt`` instructions are required to ensure registers are
3134 defined before being used. These may be able to be combined with the memory
3135 model ``s_waitcnt`` instructions as described above.
3137 The AMDGPU backend supports the following memory models:
3139 HSA Memory Model [HSA]_
3140 The HSA memory model uses a single happens-before relation for all address
3141 spaces (see :ref:`amdgpu-address-spaces`).
3142 OpenCL Memory Model [OpenCL]_
3143 The OpenCL memory model which has separate happens-before relations for the
3144 global and local address spaces. Only a fence specifying both global and
3145 local address space, and seq_cst instructions join the relationships. Since
3146 the LLVM ``memfence`` instruction does not allow an address space to be
3147 specified the OpenCL fence has to convervatively assume both local and
3148 global address space was specified. However, optimizations can often be
3149 done to eliminate the additional ``s_waitcnt`` instructions when there are
3150 no intervening memory instructions which access the corresponding address
3151 space. The code sequences in the table indicate what can be omitted for the
3152 OpenCL memory. The target triple environment is used to determine if the
3153 source language is OpenCL (see :ref:`amdgpu-opencl`).
3155 ``ds/flat_load/store/atomic`` instructions to local memory are termed LDS
3158 ``buffer/global/flat_load/store/atomic`` instructions to global memory are
3159 termed vector memory operations.
3163 * Each agent has multiple shader arrays (SA).
3164 * Each SA has multiple compute units (CU).
3165 * Each CU has multiple SIMDs that execute wavefronts.
3166 * The wavefronts for a single work-group are executed in the same CU but may be
3167 executed by different SIMDs.
3168 * Each CU has a single LDS memory shared by the wavefronts of the work-groups
3170 * All LDS operations of a CU are performed as wavefront wide operations in a
3171 global order and involve no caching. Completion is reported to a wavefront in
3173 * The LDS memory has multiple request queues shared by the SIMDs of a
3174 CU. Therefore, the LDS operations performed by different wavefronts of a work-group
3175 can be reordered relative to each other, which can result in reordering the
3176 visibility of vector memory operations with respect to LDS operations of other
3177 wavefronts in the same work-group. A ``s_waitcnt lgkmcnt(0)`` is required to
3178 ensure synchronization between LDS operations and vector memory operations
3179 between wavefronts of a work-group, but not between operations performed by the
3181 * The vector memory operations are performed as wavefront wide operations and
3182 completion is reported to a wavefront in execution order. The exception is
3183 that for GFX7-GFX9 ``flat_load/store/atomic`` instructions can report out of
3184 vector memory order if they access LDS memory, and out of LDS operation order
3185 if they access global memory.
3186 * The vector memory operations access a single vector L1 cache shared by all
3187 SIMDs a CU. Therefore, no special action is required for coherence between the
3188 lanes of a single wavefront, or for coherence between wavefronts in the same
3189 work-group. A ``buffer_wbinvl1_vol`` is required for coherence between wavefronts
3190 executing in different work-groups as they may be executing on different CUs.
3191 * The scalar memory operations access a scalar L1 cache shared by all wavefronts
3192 on a group of CUs. The scalar and vector L1 caches are not coherent. However,
3193 scalar operations are used in a restricted way so do not impact the memory
3194 model. See :ref:`amdgpu-amdhsa-memory-spaces`.
3195 * The vector and scalar memory operations use an L2 cache shared by all CUs on
3197 * The L2 cache has independent channels to service disjoint ranges of virtual
3199 * Each CU has a separate request queue per channel. Therefore, the vector and
3200 scalar memory operations performed by wavefronts executing in different work-groups
3201 (which may be executing on different CUs) of an agent can be reordered
3202 relative to each other. A ``s_waitcnt vmcnt(0)`` is required to ensure
3203 synchronization between vector memory operations of different CUs. It ensures a
3204 previous vector memory operation has completed before executing a subsequent
3205 vector memory or LDS operation and so can be used to meet the requirements of
3206 acquire and release.
3207 * The L2 cache can be kept coherent with other agents on some targets, or ranges
3208 of virtual addresses can be set up to bypass it to ensure system coherence.
3212 * Each agent has multiple shader arrays (SA).
3213 * Each SA has multiple work-group processors (WGP).
3214 * Each WGP has multiple compute units (CU).
3215 * Each CU has multiple SIMDs that execute wavefronts.
3216 * The wavefronts for a single work-group are executed in the same
3217 WGP. In CU wavefront execution mode the wavefronts may be executed by
3218 different SIMDs in the same CU. In WGP wavefront execution mode the
3219 wavefronts may be executed by different SIMDs in different CUs in the same
3221 * Each WGP has a single LDS memory shared by the wavefronts of the work-groups
3223 * All LDS operations of a WGP are performed as wavefront wide operations in a
3224 global order and involve no caching. Completion is reported to a wavefront in
3226 * The LDS memory has multiple request queues shared by the SIMDs of a
3227 WGP. Therefore, the LDS operations performed by different wavefronts of a work-group
3228 can be reordered relative to each other, which can result in reordering the
3229 visibility of vector memory operations with respect to LDS operations of other
3230 wavefronts in the same work-group. A ``s_waitcnt lgkmcnt(0)`` is required to
3231 ensure synchronization between LDS operations and vector memory operations
3232 between wavefronts of a work-group, but not between operations performed by the
3234 * The vector memory operations are performed as wavefront wide operations.
3235 Completion of load/store/sample operations are reported to a wavefront in
3236 execution order of other load/store/sample operations performed by that
3238 * The vector memory operations access a vector L0 cache. There is a single L0
3239 cache per CU. Each SIMD of a CU accesses the same L0 cache.
3240 Therefore, no special action is required for coherence between the lanes of a
3241 single wavefront. However, a ``BUFFER_GL0_INV`` is required for coherence
3242 between wavefronts executing in the same work-group as they may be executing on
3243 SIMDs of different CUs that access different L0s. A ``BUFFER_GL0_INV`` is also
3244 required for coherence between wavefronts executing in different work-groups as
3245 they may be executing on different WGPs.
3246 * The scalar memory operations access a scalar L0 cache shared by all wavefronts
3247 on a WGP. The scalar and vector L0 caches are not coherent. However, scalar
3248 operations are used in a restricted way so do not impact the memory model. See
3249 :ref:`amdgpu-amdhsa-memory-spaces`.
3250 * The vector and scalar memory L0 caches use an L1 cache shared by all WGPs on
3251 the same SA. Therefore, no special action is required for coherence between
3252 the wavefronts of a single work-group. However, a ``BUFFER_GL1_INV`` is
3253 required for coherence between wavefronts executing in different work-groups as
3254 they may be executing on different SAs that access different L1s.
3255 * The L1 caches have independent quadrants to service disjoint ranges of virtual
3257 * Each L0 cache has a separate request queue per L1 quadrant. Therefore, the
3258 vector and scalar memory operations performed by different wavefronts, whether
3259 executing in the same or different work-groups (which may be executing on
3260 different CUs accessing different L0s), can be reordered relative to each
3261 other. A ``s_waitcnt vmcnt(0) & vscnt(0)`` is required to ensure synchronization
3262 between vector memory operations of different wavefronts. It ensures a previous
3263 vector memory operation has completed before executing a subsequent vector
3264 memory or LDS operation and so can be used to meet the requirements of acquire,
3265 release and sequential consistency.
3266 * The L1 caches use an L2 cache shared by all SAs on the same agent.
3267 * The L2 cache has independent channels to service disjoint ranges of virtual
3269 * Each L1 quadrant of a single SA accesses a different L2 channel. Each L1
3270 quadrant has a separate request queue per L2 channel. Therefore, the vector
3271 and scalar memory operations performed by wavefronts executing in different
3272 work-groups (which may be executing on different SAs) of an agent can be
3273 reordered relative to each other. A ``s_waitcnt vmcnt(0) & vscnt(0)`` is
3274 required to ensure synchronization between vector memory operations of
3275 different SAs. It ensures a previous vector memory operation has completed
3276 before executing a subsequent vector memory and so can be used to meet the
3277 requirements of acquire, release and sequential consistency.
3278 * The L2 cache can be kept coherent with other agents on some targets, or ranges
3279 of virtual addresses can be set up to bypass it to ensure system coherence.
3281 Private address space uses ``buffer_load/store`` using the scratch V# (GFX6-GFX8),
3282 or ``scratch_load/store`` (GFX9-GFX10). Since only a single thread is accessing the
3283 memory, atomic memory orderings are not meaningful and all accesses are treated
3286 Constant address space uses ``buffer/global_load`` instructions (or equivalent
3287 scalar memory instructions). Since the constant address space contents do not
3288 change during the execution of a kernel dispatch it is not legal to perform
3289 stores, and atomic memory orderings are not meaningful and all access are
3290 treated as non-atomic.
3292 A memory synchronization scope wider than work-group is not meaningful for the
3293 group (LDS) address space and is treated as work-group.
3295 The memory model does not support the region address space which is treated as
3298 Acquire memory ordering is not meaningful on store atomic instructions and is
3299 treated as non-atomic.
3301 Release memory ordering is not meaningful on load atomic instructions and is
3302 treated a non-atomic.
3304 Acquire-release memory ordering is not meaningful on load or store atomic
3305 instructions and is treated as acquire and release respectively.
3307 AMDGPU backend only uses scalar memory operations to access memory that is
3308 proven to not change during the execution of the kernel dispatch. This includes
3309 constant address space and global address space for program scope const
3310 variables. Therefore the kernel machine code does not have to maintain the
3311 scalar L1 cache to ensure it is coherent with the vector L1 cache. The scalar
3312 and vector L1 caches are invalidated between kernel dispatches by CP since
3313 constant address space data may change between kernel dispatch executions. See
3314 :ref:`amdgpu-amdhsa-memory-spaces`.
3316 The one execption is if scalar writes are used to spill SGPR registers. In this
3317 case the AMDGPU backend ensures the memory location used to spill is never
3318 accessed by vector memory operations at the same time. If scalar writes are used
3319 then a ``s_dcache_wb`` is inserted before the ``s_endpgm`` and before a function
3320 return since the locations may be used for vector memory instructions by a
3321 future wavefront that uses the same scratch area, or a function call that creates a
3322 frame at the same address, respectively. There is no need for a ``s_dcache_inv``
3323 as all scalar writes are write-before-read in the same thread.
3325 For GFX6-GFX9, scratch backing memory (which is used for the private address space)
3326 is accessed with MTYPE NC_NV (non-coherenent non-volatile). Since the private
3327 address space is only accessed by a single thread, and is always
3328 write-before-read, there is never a need to invalidate these entries from the L1
3329 cache. Hence all cache invalidates are done as ``*_vol`` to only invalidate the
3330 volatile cache lines.
3332 For GFX10, scratch backing memory (which is used for the private address space)
3333 is accessed with MTYPE NC (non-coherenent). Since the private address space is
3334 only accessed by a single thread, and is always write-before-read, there is
3335 never a need to invalidate these entries from the L0 or L1 caches.
3337 For GFX10, wavefronts are executed in native mode with in-order reporting of loads
3338 and sample instructions. In this mode vmcnt reports completion of load, atomic
3339 with return and sample instructions in order, and the vscnt reports the
3340 completion of store and atomic without return in order. See ``MEM_ORDERED`` field
3341 in :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
3343 In GFX10, wavefronts can be executed in WGP or CU wavefront execution mode:
3345 * In WGP wavefront execution mode the wavefronts of a work-group are executed
3346 on the SIMDs of both CUs of the WGP. Therefore, explicit management of the per
3347 CU L0 caches is required for work-group synchronization. Also accesses to L1 at
3348 work-group scope need to be expicitly ordered as the accesses from different
3349 CUs are not ordered.
3350 * In CU wavefront execution mode the wavefronts of a work-group are executed on
3351 the SIMDs of a single CU of the WGP. Therefore, all global memory access by
3352 the work-group access the same L0 which in turn ensures L1 accesses are
3353 ordered and so do not require explicit management of the caches for
3354 work-group synchronization.
3356 See ``WGP_MODE`` field in :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`
3357 and :ref:`amdgpu-target-features`.
3359 On dGPU the kernarg backing memory is accessed as UC (uncached) to avoid needing
3360 to invalidate the L2 cache. For GFX6-GFX9, this also causes it to be treated as
3361 non-volatile and so is not invalidated by ``*_vol``. On APU it is accessed as CC
3362 (cache coherent) and so the L2 cache will be coherent with the CPU and other
3365 .. table:: AMDHSA Memory Model Code Sequences GFX6-GFX10
3366 :name: amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx10-table
3368 ============ ============ ============== ========== =============================== ==================================
3369 LLVM Instr LLVM Memory LLVM Memory AMDGPU AMDGPU Machine Code AMDGPU Machine Code
3370 Ordering Sync Scope Address GFX6-9 GFX10
3372 ============ ============ ============== ========== =============================== ==================================
3374 ----------------------------------------------------------------------------------------------------------------------
3375 load *none* *none* - global - !volatile & !nontemporal - !volatile & !nontemporal
3377 - private 1. buffer/global/flat_load 1. buffer/global/flat_load
3379 - volatile & !nontemporal - volatile & !nontemporal
3381 1. buffer/global/flat_load 1. buffer/global/flat_load
3384 - nontemporal - nontemporal
3386 1. buffer/global/flat_load 1. buffer/global/flat_load
3389 load *none* *none* - local 1. ds_load 1. ds_load
3390 store *none* *none* - global - !nontemporal - !nontemporal
3392 - private 1. buffer/global/flat_store 1. buffer/global/flat_store
3394 - nontemporal - nontemporal
3396 1. buffer/global/flat_stote 1. buffer/global/flat_store
3399 store *none* *none* - local 1. ds_store 1. ds_store
3400 **Unordered Atomic**
3401 ----------------------------------------------------------------------------------------------------------------------
3402 load atomic unordered *any* *any* *Same as non-atomic*. *Same as non-atomic*.
3403 store atomic unordered *any* *any* *Same as non-atomic*. *Same as non-atomic*.
3404 atomicrmw unordered *any* *any* *Same as monotonic *Same as monotonic
3406 **Monotonic Atomic**
3407 ----------------------------------------------------------------------------------------------------------------------
3408 load atomic monotonic - singlethread - global 1. buffer/global/flat_load 1. buffer/global/flat_load
3409 - wavefront - generic
3410 load atomic monotonic - workgroup - global 1. buffer/global/flat_load 1. buffer/global/flat_load
3413 - If CU wavefront execution mode, omit glc=1.
3415 load atomic monotonic - singlethread - local 1. ds_load 1. ds_load
3418 load atomic monotonic - agent - global 1. buffer/global/flat_load 1. buffer/global/flat_load
3419 - system - generic glc=1 glc=1 dlc=1
3420 store atomic monotonic - singlethread - global 1. buffer/global/flat_store 1. buffer/global/flat_store
3421 - wavefront - generic
3425 store atomic monotonic - singlethread - local 1. ds_store 1. ds_store
3428 atomicrmw monotonic - singlethread - global 1. buffer/global/flat_atomic 1. buffer/global/flat_atomic
3429 - wavefront - generic
3433 atomicrmw monotonic - singlethread - local 1. ds_atomic 1. ds_atomic
3437 ----------------------------------------------------------------------------------------------------------------------
3438 load atomic acquire - singlethread - global 1. buffer/global/ds/flat_load 1. buffer/global/ds/flat_load
3441 load atomic acquire - workgroup - global 1. buffer/global/flat_load 1. buffer/global_load glc=1
3443 - If CU wavefront execution mode, omit glc=1.
3445 2. s_waitcnt vmcnt(0)
3447 - If CU wavefront execution mode, omit.
3448 - Must happen before
3449 the following buffer_gl0_inv
3450 and before any following
3458 - If CU wavefront execution mode, omit.
3464 load atomic acquire - workgroup - local 1. ds_load 1. ds_load
3465 2. s_waitcnt lgkmcnt(0) 2. s_waitcnt lgkmcnt(0)
3467 - If OpenCL, omit. - If OpenCL, omit.
3468 - Must happen before - Must happen before
3469 any following the following buffer_gl0_inv
3470 global/generic and before any following
3471 load/load global/generic load/load
3472 atomic/store/store atomic/store/store
3473 atomic/atomicrmw. atomic/atomicrmw.
3474 - Ensures any - Ensures any
3475 following global following global
3476 data read is no data read is no
3477 older than the load older than the load
3478 atomic value being atomic value being
3483 - If CU wavefront execution mode, omit.
3490 load atomic acquire - workgroup - generic 1. flat_load 1. flat_load glc=1
3492 - If CU wavefront execution mode, omit glc=1.
3494 2. s_waitcnt lgkmcnt(0) 2. s_waitcnt lgkmcnt(0) &
3497 - If CU wavefront execution mode, omit vmcnt.
3498 - If OpenCL, omit. - If OpenCL, omit
3500 - Must happen before - Must happen before
3501 any following the following
3502 global/generic buffer_gl0_inv and any
3503 load/load following global/generic
3504 atomic/store/store load/load
3505 atomic/atomicrmw. atomic/store/store
3507 - Ensures any - Ensures any
3508 following global following global
3509 data read is no data read is no
3510 older than the load older than the load
3511 atomic value being atomic value being
3516 - If CU wavefront execution mode, omit.
3522 load atomic acquire - agent - global 1. buffer/global/flat_load 1. buffer/global_load
3523 - system glc=1 glc=1 dlc=1
3524 2. s_waitcnt vmcnt(0) 2. s_waitcnt vmcnt(0)
3526 - Must happen before - Must happen before
3528 buffer_wbinvl1_vol. buffer_gl*_inv.
3529 - Ensures the load - Ensures the load
3530 has completed has completed
3531 before invalidating before invalidating
3532 the cache. the caches.
3534 3. buffer_wbinvl1_vol 3. buffer_gl0_inv;
3537 - Must happen before - Must happen before
3538 any following any following
3539 global/generic global/generic
3541 atomic/atomicrmw. atomic/atomicrmw.
3542 - Ensures that - Ensures that
3544 loads will not see loads will not see
3545 stale global data. stale global data.
3547 load atomic acquire - agent - generic 1. flat_load glc=1 1. flat_load glc=1 dlc=1
3548 - system 2. s_waitcnt vmcnt(0) & 2. s_waitcnt vmcnt(0) &
3549 lgkmcnt(0) lgkmcnt(0)
3551 - If OpenCL omit - If OpenCL omit
3552 lgkmcnt(0). lgkmcnt(0).
3553 - Must happen before - Must happen before
3555 buffer_wbinvl1_vol. buffer_gl*_invl.
3556 - Ensures the flat_load - Ensures the flat_load
3557 has completed has completed
3558 before invalidating before invalidating
3559 the cache. the caches.
3561 3. buffer_wbinvl1_vol 3. buffer_gl0_inv;
3564 - Must happen before - Must happen before
3565 any following any following
3566 global/generic global/generic
3568 atomic/atomicrmw. atomic/atomicrmw.
3569 - Ensures that - Ensures that
3570 following loads following loads
3571 will not see stale will not see stale
3572 global data. global data.
3574 atomicrmw acquire - singlethread - global 1. buffer/global/ds/flat_atomic 1. buffer/global/ds/flat_atomic
3577 atomicrmw acquire - workgroup - global 1. buffer/global/flat_atomic 1. buffer/global_atomic
3578 2. s_waitcnt vm/vscnt(0)
3580 - If CU wavefront execution mode, omit.
3581 - Use vmcnt if atomic with
3582 return and vscnt if atomic
3584 - Must happen before
3585 the following buffer_gl0_inv
3586 and before any following
3594 - If CU wavefront execution mode, omit.
3600 atomicrmw acquire - workgroup - local 1. ds_atomic 1. ds_atomic
3601 2. waitcnt lgkmcnt(0) 2. waitcnt lgkmcnt(0)
3603 - If OpenCL, omit. - If OpenCL, omit.
3604 - Must happen before - Must happen before
3605 any following the following
3606 global/generic buffer_gl0_inv.
3610 - Ensures any - Ensures any
3611 following global following global
3612 data read is no data read is no
3613 older than the older than the
3614 atomicrmw value atomicrmw value
3615 being acquired. being acquired.
3625 atomicrmw acquire - workgroup - generic 1. flat_atomic 1. flat_atomic
3626 2. waitcnt lgkmcnt(0) 2. waitcnt lgkmcnt(0) &
3629 - If CU wavefront execution mode, omit vm/vscnt.
3630 - If OpenCL, omit. - If OpenCL, omit
3631 waitcnt lgkmcnt(0)..
3632 - Use vmcnt if atomic with
3633 return and vscnt if atomic
3636 - Must happen before - Must happen before
3637 any following the following
3638 global/generic buffer_gl0_inv.
3642 - Ensures any - Ensures any
3643 following global following global
3644 data read is no data read is no
3645 older than the older than the
3646 atomicrmw value atomicrmw value
3647 being acquired. being acquired.
3651 - If CU wavefront execution mode, omit.
3657 atomicrmw acquire - agent - global 1. buffer/global/flat_atomic 1. buffer/global_atomic
3658 - system 2. s_waitcnt vmcnt(0) 2. s_waitcnt vm/vscnt(0)
3660 - Use vmcnt if atomic with
3661 return and vscnt if atomic
3664 - Must happen before - Must happen before
3666 buffer_wbinvl1_vol. buffer_gl*_inv.
3667 - Ensures the - Ensures the
3668 atomicrmw has atomicrmw has
3669 completed before completed before
3670 invalidating the invalidating the
3673 3. buffer_wbinvl1_vol 3. buffer_gl0_inv;
3676 - Must happen before - Must happen before
3677 any following any following
3678 global/generic global/generic
3680 atomic/atomicrmw. atomic/atomicrmw.
3681 - Ensures that - Ensures that
3682 following loads following loads
3683 will not see stale will not see stale
3684 global data. global data.
3686 atomicrmw acquire - agent - generic 1. flat_atomic 1. flat_atomic
3687 - system 2. s_waitcnt vmcnt(0) & 2. s_waitcnt vm/vscnt(0) &
3688 lgkmcnt(0) lgkmcnt(0)
3690 - If OpenCL, omit - If OpenCL, omit
3691 lgkmcnt(0). lgkmcnt(0).
3692 - Use vmcnt if atomic with
3693 return and vscnt if atomic
3695 - Must happen before - Must happen before
3697 buffer_wbinvl1_vol. buffer_gl*_inv.
3698 - Ensures the - Ensures the
3699 atomicrmw has atomicrmw has
3700 completed before completed before
3701 invalidating the invalidating the
3704 3. buffer_wbinvl1_vol 3. buffer_gl0_inv;
3707 - Must happen before - Must happen before
3708 any following any following
3709 global/generic global/generic
3711 atomic/atomicrmw. atomic/atomicrmw.
3712 - Ensures that - Ensures that
3713 following loads following loads
3714 will not see stale will not see stale
3715 global data. global data.
3717 fence acquire - singlethread *none* *none* *none*
3719 fence acquire - workgroup *none* 1. s_waitcnt lgkmcnt(0) 1. s_waitcnt lgkmcnt(0) &
3722 - If CU wavefront execution mode, omit vmcnt and
3724 - If OpenCL and - If OpenCL and
3725 address space is address space is
3726 not generic, omit. not generic, omit
3731 vmcnt(0) and vscnt(0).
3732 - However, since LLVM - However, since LLVM
3733 currently has no currently has no
3734 address space on address space on
3735 the fence need to the fence need to
3736 conservatively conservatively
3737 always generate. If always generate. If
3738 fence had an fence had an
3739 address space then address space then
3740 set to address set to address
3741 space of OpenCL space of OpenCL
3742 fence flag, or to fence flag, or to
3743 generic if both generic if both
3744 local and global local and global
3746 specified. specified.
3757 fence-paired-atomic).
3758 - Must happen before
3769 fence-paired-atomic.
3770 - Could be split into
3773 vscnt(0) and s_waitcnt
3779 - s_waitcnt vmcnt(0)
3784 atomicrmw-with-return-value
3791 fence-paired-atomic).
3792 - s_waitcnt vscnt(0)
3796 atomicrmw-no-return-value
3803 fence-paired-atomic).
3804 - s_waitcnt lgkmcnt(0)
3815 fence-paired-atomic).
3816 - Must happen before
3830 fence-paired-atomic.
3834 - If CU wavefront execution mode, omit.
3840 fence acquire - agent *none* 1. s_waitcnt lgkmcnt(0) & 1. s_waitcnt lgkmcnt(0) &
3841 - system vmcnt(0) vmcnt(0) & vscnt(0)
3843 - If OpenCL and - If OpenCL and
3844 address space is address space is
3845 not generic, omit not generic, omit
3846 lgkmcnt(0). lgkmcnt(0).
3850 vmcnt(0) and vscnt(0).
3851 - However, since LLVM - However, since LLVM
3852 currently has no currently has no
3853 address space on address space on
3854 the fence need to the fence need to
3855 conservatively conservatively
3856 always generate always generate
3857 (see comment for (see comment for
3858 previous fence). previous fence).
3859 - Could be split into
3868 - s_waitcnt vmcnt(0)
3879 fence-paired-atomic).
3880 - s_waitcnt lgkmcnt(0)
3891 fence-paired-atomic).
3892 - Must happen before
3906 fence-paired-atomic.
3907 - Could be split into
3910 vscnt(0) and s_waitcnt
3916 - s_waitcnt vmcnt(0)
3921 atomicrmw-with-return-value
3928 fence-paired-atomic).
3929 - s_waitcnt vscnt(0)
3933 atomicrmw-no-return-value
3940 fence-paired-atomic).
3941 - s_waitcnt lgkmcnt(0)
3952 fence-paired-atomic).
3953 - Must happen before
3967 fence-paired-atomic.
3969 2. buffer_wbinvl1_vol 2. buffer_gl0_inv;
3972 - Must happen before any - Must happen before any
3973 following global/generic following global/generic
3975 atomic/store/store atomic/store/store
3976 atomic/atomicrmw. atomic/atomicrmw.
3977 - Ensures that - Ensures that
3978 following loads following loads
3979 will not see stale will not see stale
3980 global data. global data.
3983 ----------------------------------------------------------------------------------------------------------------------
3984 store atomic release - singlethread - global 1. buffer/global/ds/flat_store 1. buffer/global/ds/flat_store
3987 store atomic release - workgroup - global 1. s_waitcnt lgkmcnt(0) 1. s_waitcnt lgkmcnt(0) &
3990 - If CU wavefront execution mode, omit vmcnt and
3992 - If OpenCL, omit. - If OpenCL, omit
4000 - Could be split into
4003 vscnt(0) and s_waitcnt
4009 - s_waitcnt vmcnt(0)
4012 global/generic load/load
4014 atomicrmw-with-return-value.
4015 - s_waitcnt vscnt(0)
4021 atomicrmw-no-return-value.
4022 - s_waitcnt lgkmcnt(0)
4029 - Must happen before - Must happen before
4030 the following the following
4032 - Ensures that all - Ensures that all
4033 memory operations memory operations
4035 completed before completed before
4036 performing the performing the
4037 store that is being store that is being
4040 2. buffer/global/flat_store 2. buffer/global_store
4041 store atomic release - workgroup - local 1. waitcnt vmcnt(0) & vscnt(0)
4043 - If CU wavefront execution mode, omit.
4045 - Could be split into
4047 vmcnt(0) and s_waitcnt
4053 - s_waitcnt vmcnt(0)
4056 global/generic load/load
4058 atomicrmw-with-return-value.
4059 - s_waitcnt vscnt(0)
4064 atomicrmw-no-return-value.
4065 - Must happen before
4076 1. ds_store 2. ds_store
4077 store atomic release - workgroup - generic 1. s_waitcnt lgkmcnt(0) 1. s_waitcnt lgkmcnt(0) &
4080 - If CU wavefront execution mode, omit vmcnt and
4082 - If OpenCL, omit. - If OpenCL, omit
4090 - Could be split into
4093 vscnt(0) and s_waitcnt
4099 - s_waitcnt vmcnt(0)
4102 global/generic load/load
4104 atomicrmw-with-return-value.
4105 - s_waitcnt vscnt(0)
4111 atomicrmw-no-return-value.
4112 - s_waitcnt lgkmcnt(0)
4115 local/generic load/store/load
4116 atomic/store atomic/atomicrmw.
4117 - Must happen before - Must happen before
4118 the following the following
4120 - Ensures that all - Ensures that all
4121 memory operations memory operations
4123 completed before completed before
4124 performing the performing the
4125 store that is being store that is being
4128 2. flat_store 2. flat_store
4129 store atomic release - agent - global 1. s_waitcnt lgkmcnt(0) & 1. s_waitcnt lgkmcnt(0) &
4130 - system - generic vmcnt(0) vmcnt(0) & vscnt(0)
4132 - If OpenCL, omit - If OpenCL, omit
4133 lgkmcnt(0). lgkmcnt(0).
4134 - Could be split into - Could be split into
4135 separate s_waitcnt separate s_waitcnt
4136 vmcnt(0) and vmcnt(0), s_waitcnt vscnt(0)
4137 s_waitcnt and s_waitcnt
4138 lgkmcnt(0) to allow lgkmcnt(0) to allow
4139 them to be them to be
4140 independently moved independently moved
4141 according to the according to the
4142 following rules. following rules.
4143 - s_waitcnt vmcnt(0) - s_waitcnt vmcnt(0)
4144 must happen after must happen after
4145 any preceding any preceding
4146 global/generic global/generic
4147 load/store/load load/load
4148 atomic/store atomic/
4149 atomic/atomicrmw. atomicrmw-with-return-value.
4150 - s_waitcnt vscnt(0)
4155 atomicrmw-no-return-value.
4156 - s_waitcnt lgkmcnt(0) - s_waitcnt lgkmcnt(0)
4157 must happen after must happen after
4158 any preceding any preceding
4159 local/generic local/generic
4160 load/store/load load/store/load
4161 atomic/store atomic/store
4162 atomic/atomicrmw. atomic/atomicrmw.
4163 - Must happen before - Must happen before
4164 the following the following
4166 - Ensures that all - Ensures that all
4167 memory operations memory operations
4168 to memory have to memory have
4169 completed before completed before
4170 performing the performing the
4171 store that is being store that is being
4174 2. buffer/global/ds/flat_store 2. buffer/global/ds/flat_store
4175 atomicrmw release - singlethread - global 1. buffer/global/ds/flat_atomic 1. buffer/global/ds/flat_atomic
4178 atomicrmw release - workgroup - global 1. s_waitcnt lgkmcnt(0) 1. s_waitcnt lgkmcnt(0) &
4181 - If CU wavefront execution mode, omit vmcnt and
4191 - Could be split into
4194 vscnt(0) and s_waitcnt
4200 - s_waitcnt vmcnt(0)
4203 global/generic load/load
4205 atomicrmw-with-return-value.
4206 - s_waitcnt vscnt(0)
4212 atomicrmw-no-return-value.
4213 - s_waitcnt lgkmcnt(0)
4220 - Must happen before - Must happen before
4221 the following the following
4222 atomicrmw. atomicrmw.
4223 - Ensures that all - Ensures that all
4224 memory operations memory operations
4226 completed before completed before
4227 performing the performing the
4228 atomicrmw that is atomicrmw that is
4229 being released. being released.
4231 2. buffer/global/flat_atomic 2. buffer/global_atomic
4232 atomicrmw release - workgroup - local 1. waitcnt vmcnt(0) & vscnt(0)
4234 - If CU wavefront execution mode, omit.
4236 - Could be split into
4238 vmcnt(0) and s_waitcnt
4244 - s_waitcnt vmcnt(0)
4247 global/generic load/load
4249 atomicrmw-with-return-value.
4250 - s_waitcnt vscnt(0)
4255 atomicrmw-no-return-value.
4256 - Must happen before
4267 1. ds_atomic 2. ds_atomic
4268 atomicrmw release - workgroup - generic 1. s_waitcnt lgkmcnt(0) 1. s_waitcnt lgkmcnt(0) &
4271 - If CU wavefront execution mode, omit vmcnt and
4273 - If OpenCL, omit. - If OpenCL, omit
4281 - Could be split into
4284 vscnt(0) and s_waitcnt
4290 - s_waitcnt vmcnt(0)
4293 global/generic load/load
4295 atomicrmw-with-return-value.
4296 - s_waitcnt vscnt(0)
4302 atomicrmw-no-return-value.
4303 - s_waitcnt lgkmcnt(0)
4306 local/generic load/store/load
4307 atomic/store atomic/atomicrmw.
4308 - Must happen before - Must happen before
4309 the following the following
4310 atomicrmw. atomicrmw.
4311 - Ensures that all - Ensures that all
4312 memory operations memory operations
4314 completed before completed before
4315 performing the performing the
4316 atomicrmw that is atomicrmw that is
4317 being released. being released.
4319 2. flat_atomic 2. flat_atomic
4320 atomicrmw release - agent - global 1. s_waitcnt lgkmcnt(0) & 1. s_waitcnt lkkmcnt(0) &
4321 - system - generic vmcnt(0) vmcnt(0) & vscnt(0)
4323 - If OpenCL, omit - If OpenCL, omit
4324 lgkmcnt(0). lgkmcnt(0).
4325 - Could be split into - Could be split into
4326 separate s_waitcnt separate s_waitcnt
4327 vmcnt(0) and vmcnt(0), s_waitcnt
4328 s_waitcnt vscnt(0) and s_waitcnt
4329 lgkmcnt(0) to allow lgkmcnt(0) to allow
4330 them to be them to be
4331 independently moved independently moved
4332 according to the according to the
4333 following rules. following rules.
4334 - s_waitcnt vmcnt(0) - s_waitcnt vmcnt(0)
4335 must happen after must happen after
4336 any preceding any preceding
4337 global/generic global/generic
4338 load/store/load load/load atomic/
4339 atomic/store atomicrmw-with-return-value.
4341 - s_waitcnt vscnt(0)
4346 atomicrmw-no-return-value.
4347 - s_waitcnt lgkmcnt(0) - s_waitcnt lgkmcnt(0)
4348 must happen after must happen after
4349 any preceding any preceding
4350 local/generic local/generic
4351 load/store/load load/store/load
4352 atomic/store atomic/store
4353 atomic/atomicrmw. atomic/atomicrmw.
4354 - Must happen before - Must happen before
4355 the following the following
4356 atomicrmw. atomicrmw.
4357 - Ensures that all - Ensures that all
4358 memory operations memory operations
4359 to global and local to global and local
4360 have completed have completed
4361 before performing before performing
4362 the atomicrmw that the atomicrmw that
4363 is being released. is being released.
4365 2. buffer/global/ds/flat_atomic 2. buffer/global/ds/flat_atomic
4366 fence release - singlethread *none* *none* *none*
4368 fence release - workgroup *none* 1. s_waitcnt lgkmcnt(0) 1. s_waitcnt lgkmcnt(0) &
4371 - If CU wavefront execution mode, omit vmcnt and
4373 - If OpenCL and - If OpenCL and
4374 address space is address space is
4375 not generic, omit. not generic, omit
4380 vmcnt(0) and vscnt(0).
4381 - However, since LLVM - However, since LLVM
4382 currently has no currently has no
4383 address space on address space on
4384 the fence need to the fence need to
4385 conservatively conservatively
4386 always generate. If always generate. If
4387 fence had an fence had an
4388 address space then address space then
4389 set to address set to address
4390 space of OpenCL space of OpenCL
4391 fence flag, or to fence flag, or to
4392 generic if both generic if both
4393 local and global local and global
4395 specified. specified.
4402 - Could be split into
4405 vscnt(0) and s_waitcnt
4411 - s_waitcnt vmcnt(0)
4417 atomicrmw-with-return-value.
4418 - s_waitcnt vscnt(0)
4423 atomicrmw-no-return-value.
4424 - s_waitcnt lgkmcnt(0)
4429 atomic/store atomic/
4431 - Must happen before - Must happen before
4432 any following store any following store
4433 atomic/atomicrmw atomic/atomicrmw
4434 with an equal or with an equal or
4435 wider sync scope wider sync scope
4436 and memory ordering and memory ordering
4437 stronger than stronger than
4438 unordered (this is unordered (this is
4439 termed the termed the
4440 fence-paired-atomic). fence-paired-atomic).
4441 - Ensures that all - Ensures that all
4442 memory operations memory operations
4444 completed before completed before
4445 performing the performing the
4447 fence-paired-atomic. fence-paired-atomic.
4449 fence release - agent *none* 1. s_waitcnt lgkmcnt(0) & 1. s_waitcnt lgkmcnt(0) &
4450 - system vmcnt(0) vmcnt(0) & vscnt(0)
4452 - If OpenCL and - If OpenCL and
4453 address space is address space is
4454 not generic, omit not generic, omit
4455 lgkmcnt(0). lgkmcnt(0).
4456 - If OpenCL and - If OpenCL and
4457 address space is address space is
4458 local, omit local, omit
4459 vmcnt(0). vmcnt(0) and vscnt(0).
4460 - However, since LLVM - However, since LLVM
4461 currently has no currently has no
4462 address space on address space on
4463 the fence need to the fence need to
4464 conservatively conservatively
4465 always generate. If always generate. If
4466 fence had an fence had an
4467 address space then address space then
4468 set to address set to address
4469 space of OpenCL space of OpenCL
4470 fence flag, or to fence flag, or to
4471 generic if both generic if both
4472 local and global local and global
4474 specified. specified.
4475 - Could be split into - Could be split into
4476 separate s_waitcnt separate s_waitcnt
4477 vmcnt(0) and vmcnt(0), s_waitcnt
4478 s_waitcnt vscnt(0) and s_waitcnt
4479 lgkmcnt(0) to allow lgkmcnt(0) to allow
4480 them to be them to be
4481 independently moved independently moved
4482 according to the according to the
4483 following rules. following rules.
4484 - s_waitcnt vmcnt(0) - s_waitcnt vmcnt(0)
4485 must happen after must happen after
4486 any preceding any preceding
4487 global/generic global/generic
4488 load/store/load load/load atomic/
4489 atomic/store atomicrmw-with-return-value.
4491 - s_waitcnt vscnt(0)
4496 atomicrmw-no-return-value.
4497 - s_waitcnt lgkmcnt(0) - s_waitcnt lgkmcnt(0)
4498 must happen after must happen after
4499 any preceding any preceding
4500 local/generic local/generic
4501 load/store/load load/store/load
4502 atomic/store atomic/store
4503 atomic/atomicrmw. atomic/atomicrmw.
4504 - Must happen before - Must happen before
4505 any following store any following store
4506 atomic/atomicrmw atomic/atomicrmw
4507 with an equal or with an equal or
4508 wider sync scope wider sync scope
4509 and memory ordering and memory ordering
4510 stronger than stronger than
4511 unordered (this is unordered (this is
4512 termed the termed the
4513 fence-paired-atomic). fence-paired-atomic).
4514 - Ensures that all - Ensures that all
4515 memory operations memory operations
4517 completed before completed before
4518 performing the performing the
4520 fence-paired-atomic. fence-paired-atomic.
4522 **Acquire-Release Atomic**
4523 ----------------------------------------------------------------------------------------------------------------------
4524 atomicrmw acq_rel - singlethread - global 1. buffer/global/ds/flat_atomic 1. buffer/global/ds/flat_atomic
4527 atomicrmw acq_rel - workgroup - global 1. s_waitcnt lgkmcnt(0) 1. s_waitcnt lgkmcnt(0) &
4530 - If CU wavefront execution mode, omit vmcnt and
4532 - If OpenCL, omit. - If OpenCL, omit
4533 s_waitcnt lgkmcnt(0).
4534 - Must happen after - Must happen after
4535 any preceding any preceding
4536 local/generic local/generic
4537 load/store/load load/store/load
4538 atomic/store atomic/store
4539 atomic/atomicrmw. atomic/atomicrmw.
4540 - Could be split into
4543 vscnt(0) and s_waitcnt
4549 - s_waitcnt vmcnt(0)
4552 global/generic load/load
4554 atomicrmw-with-return-value.
4555 - s_waitcnt vscnt(0)
4561 atomicrmw-no-return-value.
4562 - s_waitcnt lgkmcnt(0)
4565 local/generic load/store/load
4566 atomic/store atomic/atomicrmw.
4567 - Must happen before - Must happen before
4568 the following the following
4569 atomicrmw. atomicrmw.
4570 - Ensures that all - Ensures that all
4571 memory operations memory operations
4573 completed before completed before
4574 performing the performing the
4575 atomicrmw that is atomicrmw that is
4576 being released. being released.
4578 2. buffer/global/flat_atomic 2. buffer/global_atomic
4579 3. s_waitcnt vm/vscnt(0)
4581 - If CU wavefront execution mode, omit vm/vscnt.
4582 - Use vmcnt if atomic with
4583 return and vscnt if atomic
4586 - Must happen before
4598 - If CU wavefront execution mode, omit.
4604 atomicrmw acq_rel - workgroup - local 1. waitcnt vmcnt(0) & vscnt(0)
4606 - If CU wavefront execution mode, omit.
4608 - Could be split into
4610 vmcnt(0) and s_waitcnt
4616 - s_waitcnt vmcnt(0)
4619 global/generic load/load
4621 atomicrmw-with-return-value.
4622 - s_waitcnt vscnt(0)
4627 atomicrmw-no-return-value.
4628 - Must happen before
4639 1. ds_atomic 2. ds_atomic
4640 2. s_waitcnt lgkmcnt(0) 3. s_waitcnt lgkmcnt(0)
4642 - If OpenCL, omit. - If OpenCL, omit.
4643 - Must happen before - Must happen before
4644 any following the following
4645 global/generic buffer_gl0_inv.
4649 - Ensures any - Ensures any
4650 following global following global
4651 data read is no data read is no
4652 older than the load older than the load
4653 atomic value being atomic value being
4658 - If CU wavefront execution mode, omit.
4665 atomicrmw acq_rel - workgroup - generic 1. s_waitcnt lgkmcnt(0) 1. s_waitcnt lgkmcnt(0) &
4668 - If CU wavefront execution mode, omit vmcnt and
4670 - If OpenCL, omit. - If OpenCL, omit
4678 - Could be split into
4681 vscnt(0) and s_waitcnt
4687 - s_waitcnt vmcnt(0)
4690 global/generic load/load
4692 atomicrmw-with-return-value.
4693 - s_waitcnt vscnt(0)
4699 atomicrmw-no-return-value.
4700 - s_waitcnt lgkmcnt(0)
4703 local/generic load/store/load
4704 atomic/store atomic/atomicrmw.
4705 - Must happen before - Must happen before
4706 the following the following
4707 atomicrmw. atomicrmw.
4708 - Ensures that all - Ensures that all
4709 memory operations memory operations
4711 completed before completed before
4712 performing the performing the
4713 atomicrmw that is atomicrmw that is
4714 being released. being released.
4716 2. flat_atomic 2. flat_atomic
4717 3. s_waitcnt lgkmcnt(0) 3. s_waitcnt lgkmcnt(0) &
4720 - If CU wavefront execution mode, omit vm/vscnt.
4721 - If OpenCL, omit. - If OpenCL, omit
4723 - Must happen before - Must happen before
4724 any following the following
4725 global/generic buffer_gl0_inv.
4729 - Ensures any - Ensures any
4730 following global following global
4731 data read is no data read is no
4732 older than the load older than the load
4733 atomic value being atomic value being
4738 - If CU wavefront execution mode, omit.
4744 atomicrmw acq_rel - agent - global 1. s_waitcnt lgkmcnt(0) & 1. s_waitcnt lgkmcnt(0) &
4745 - system vmcnt(0) vmcnt(0) & vscnt(0)
4747 - If OpenCL, omit - If OpenCL, omit
4748 lgkmcnt(0). lgkmcnt(0).
4749 - Could be split into - Could be split into
4750 separate s_waitcnt separate s_waitcnt
4751 vmcnt(0) and vmcnt(0), s_waitcnt
4752 s_waitcnt vscnt(0) and s_waitcnt
4753 lgkmcnt(0) to allow lgkmcnt(0) to allow
4754 them to be them to be
4755 independently moved independently moved
4756 according to the according to the
4757 following rules. following rules.
4758 - s_waitcnt vmcnt(0) - s_waitcnt vmcnt(0)
4759 must happen after must happen after
4760 any preceding any preceding
4761 global/generic global/generic
4762 load/store/load load/load atomic/
4763 atomic/store atomicrmw-with-return-value.
4765 - s_waitcnt vscnt(0)
4770 atomicrmw-no-return-value.
4771 - s_waitcnt lgkmcnt(0) - s_waitcnt lgkmcnt(0)
4772 must happen after must happen after
4773 any preceding any preceding
4774 local/generic local/generic
4775 load/store/load load/store/load
4776 atomic/store atomic/store
4777 atomic/atomicrmw. atomic/atomicrmw.
4778 - Must happen before - Must happen before
4779 the following the following
4780 atomicrmw. atomicrmw.
4781 - Ensures that all - Ensures that all
4782 memory operations memory operations
4783 to global have to global have
4784 completed before completed before
4785 performing the performing the
4786 atomicrmw that is atomicrmw that is
4787 being released. being released.
4789 2. buffer/global/flat_atomic 2. buffer/global_atomic
4790 3. s_waitcnt vmcnt(0) 3. s_waitcnt vm/vscnt(0)
4792 - Use vmcnt if atomic with
4793 return and vscnt if atomic
4796 - Must happen before - Must happen before
4798 buffer_wbinvl1_vol. buffer_gl*_inv.
4799 - Ensures the - Ensures the
4800 atomicrmw has atomicrmw has
4801 completed before completed before
4802 invalidating the invalidating the
4805 4. buffer_wbinvl1_vol 4. buffer_gl0_inv;
4808 - Must happen before - Must happen before
4809 any following any following
4810 global/generic global/generic
4812 atomic/atomicrmw. atomic/atomicrmw.
4813 - Ensures that - Ensures that
4814 following loads following loads
4815 will not see stale will not see stale
4816 global data. global data.
4818 atomicrmw acq_rel - agent - generic 1. s_waitcnt lgkmcnt(0) & 1. s_waitcnt lgkmcnt(0) &
4819 - system vmcnt(0) vmcnt(0) & vscnt(0)
4821 - If OpenCL, omit - If OpenCL, omit
4822 lgkmcnt(0). lgkmcnt(0).
4823 - Could be split into - Could be split into
4824 separate s_waitcnt separate s_waitcnt
4825 vmcnt(0) and vmcnt(0), s_waitcnt
4826 s_waitcnt vscnt(0) and s_waitcnt
4827 lgkmcnt(0) to allow lgkmcnt(0) to allow
4828 them to be them to be
4829 independently moved independently moved
4830 according to the according to the
4831 following rules. following rules.
4832 - s_waitcnt vmcnt(0) - s_waitcnt vmcnt(0)
4833 must happen after must happen after
4834 any preceding any preceding
4835 global/generic global/generic
4836 load/store/load load/load atomic
4837 atomic/store atomicrmw-with-return-value.
4839 - s_waitcnt vscnt(0)
4844 atomicrmw-no-return-value.
4845 - s_waitcnt lgkmcnt(0) - s_waitcnt lgkmcnt(0)
4846 must happen after must happen after
4847 any preceding any preceding
4848 local/generic local/generic
4849 load/store/load load/store/load
4850 atomic/store atomic/store
4851 atomic/atomicrmw. atomic/atomicrmw.
4852 - Must happen before - Must happen before
4853 the following the following
4854 atomicrmw. atomicrmw.
4855 - Ensures that all - Ensures that all
4856 memory operations memory operations
4858 completed before completed before
4859 performing the performing the
4860 atomicrmw that is atomicrmw that is
4861 being released. being released.
4863 2. flat_atomic 2. flat_atomic
4864 3. s_waitcnt vmcnt(0) & 3. s_waitcnt vm/vscnt(0) &
4865 lgkmcnt(0) lgkmcnt(0)
4867 - If OpenCL, omit - If OpenCL, omit
4868 lgkmcnt(0). lgkmcnt(0).
4869 - Use vmcnt if atomic with
4870 return and vscnt if atomic
4872 - Must happen before - Must happen before
4874 buffer_wbinvl1_vol. buffer_gl*_inv.
4875 - Ensures the - Ensures the
4876 atomicrmw has atomicrmw has
4877 completed before completed before
4878 invalidating the invalidating the
4881 4. buffer_wbinvl1_vol 4. buffer_gl0_inv;
4884 - Must happen before - Must happen before
4885 any following any following
4886 global/generic global/generic
4888 atomic/atomicrmw. atomic/atomicrmw.
4889 - Ensures that - Ensures that
4890 following loads following loads
4891 will not see stale will not see stale
4892 global data. global data.
4894 fence acq_rel - singlethread *none* *none* *none*
4896 fence acq_rel - workgroup *none* 1. s_waitcnt lgkmcnt(0) 1. s_waitcnt lgkmcnt(0) &
4899 - If CU wavefront execution mode, omit vmcnt and
4901 - If OpenCL and - If OpenCL and
4902 address space is address space is
4903 not generic, omit. not generic, omit
4908 vmcnt(0) and vscnt(0).
4909 - However, - However,
4910 since LLVM since LLVM
4911 currently has no currently has no
4912 address space on address space on
4913 the fence need to the fence need to
4914 conservatively conservatively
4915 always generate always generate
4916 (see comment for (see comment for
4917 previous fence). previous fence).
4924 - Could be split into
4927 vscnt(0) and s_waitcnt
4933 - s_waitcnt vmcnt(0)
4939 atomicrmw-with-return-value.
4940 - s_waitcnt vscnt(0)
4945 atomicrmw-no-return-value.
4946 - s_waitcnt lgkmcnt(0)
4951 atomic/store atomic/
4953 - Must happen before - Must happen before
4954 any following any following
4955 global/generic global/generic
4957 atomic/store/store atomic/store/store
4958 atomic/atomicrmw. atomic/atomicrmw.
4959 - Ensures that all - Ensures that all
4960 memory operations memory operations
4962 completed before completed before
4963 performing any performing any
4964 following global following global
4965 memory operations. memory operations.
4966 - Ensures that the - Ensures that the
4968 local/generic load local/generic load
4969 atomic/atomicrmw atomic/atomicrmw
4970 with an equal or with an equal or
4971 wider sync scope wider sync scope
4972 and memory ordering and memory ordering
4973 stronger than stronger than
4974 unordered (this is unordered (this is
4975 termed the termed the
4976 acquire-fence-paired-atomic acquire-fence-paired-atomic
4977 ) has completed ) has completed
4978 before following before following
4979 global memory global memory
4980 operations. This operations. This
4981 satisfies the satisfies the
4982 requirements of requirements of
4984 - Ensures that all - Ensures that all
4985 previous memory previous memory
4986 operations have operations have
4987 completed before a completed before a
4989 local/generic store local/generic store
4990 atomic/atomicrmw atomic/atomicrmw
4991 with an equal or with an equal or
4992 wider sync scope wider sync scope
4993 and memory ordering and memory ordering
4994 stronger than stronger than
4995 unordered (this is unordered (this is
4996 termed the termed the
4997 release-fence-paired-atomic release-fence-paired-atomic
4998 ). This satisfies the ). This satisfies the
4999 requirements of requirements of
5001 - Must happen before
5005 acquire-fence-paired
5006 atomic has completed
5015 acquire-fence-paired-atomic.
5019 - If CU wavefront execution mode, omit.
5025 fence acq_rel - agent *none* 1. s_waitcnt lgkmcnt(0) & 1. s_waitcnt lgkmcnt(0) &
5026 - system vmcnt(0) vmcnt(0) & vscnt(0)
5028 - If OpenCL and - If OpenCL and
5029 address space is address space is
5030 not generic, omit not generic, omit
5031 lgkmcnt(0). lgkmcnt(0).
5035 vmcnt(0) and vscnt(0).
5036 - However, since LLVM - However, since LLVM
5037 currently has no currently has no
5038 address space on address space on
5039 the fence need to the fence need to
5040 conservatively conservatively
5041 always generate always generate
5042 (see comment for (see comment for
5043 previous fence). previous fence).
5044 - Could be split into - Could be split into
5045 separate s_waitcnt separate s_waitcnt
5046 vmcnt(0) and vmcnt(0), s_waitcnt
5047 s_waitcnt vscnt(0) and s_waitcnt
5048 lgkmcnt(0) to allow lgkmcnt(0) to allow
5049 them to be them to be
5050 independently moved independently moved
5051 according to the according to the
5052 following rules. following rules.
5053 - s_waitcnt vmcnt(0) - s_waitcnt vmcnt(0)
5054 must happen after must happen after
5055 any preceding any preceding
5056 global/generic global/generic
5057 load/store/load load/load
5058 atomic/store atomic/
5059 atomic/atomicrmw. atomicrmw-with-return-value.
5060 - s_waitcnt vscnt(0)
5065 atomicrmw-no-return-value.
5066 - s_waitcnt lgkmcnt(0) - s_waitcnt lgkmcnt(0)
5067 must happen after must happen after
5068 any preceding any preceding
5069 local/generic local/generic
5070 load/store/load load/store/load
5071 atomic/store atomic/store
5072 atomic/atomicrmw. atomic/atomicrmw.
5073 - Must happen before - Must happen before
5074 the following the following
5075 buffer_wbinvl1_vol. buffer_gl*_inv.
5076 - Ensures that the - Ensures that the
5078 global/local/generic global/local/generic
5080 atomic/atomicrmw atomic/atomicrmw
5081 with an equal or with an equal or
5082 wider sync scope wider sync scope
5083 and memory ordering and memory ordering
5084 stronger than stronger than
5085 unordered (this is unordered (this is
5086 termed the termed the
5087 acquire-fence-paired-atomic acquire-fence-paired-atomic
5088 ) has completed ) has completed
5089 before invalidating before invalidating
5090 the cache. This the caches. This
5091 satisfies the satisfies the
5092 requirements of requirements of
5094 - Ensures that all - Ensures that all
5095 previous memory previous memory
5096 operations have operations have
5097 completed before a completed before a
5099 global/local/generic global/local/generic
5101 atomic/atomicrmw atomic/atomicrmw
5102 with an equal or with an equal or
5103 wider sync scope wider sync scope
5104 and memory ordering and memory ordering
5105 stronger than stronger than
5106 unordered (this is unordered (this is
5107 termed the termed the
5108 release-fence-paired-atomic release-fence-paired-atomic
5109 ). This satisfies the ). This satisfies the
5110 requirements of requirements of
5113 2. buffer_wbinvl1_vol 2. buffer_gl0_inv;
5116 - Must happen before - Must happen before
5117 any following any following
5118 global/generic global/generic
5120 atomic/store/store atomic/store/store
5121 atomic/atomicrmw. atomic/atomicrmw.
5122 - Ensures that - Ensures that
5123 following loads following loads
5124 will not see stale will not see stale
5125 global data. This global data. This
5126 satisfies the satisfies the
5127 requirements of requirements of
5130 **Sequential Consistent Atomic**
5131 ----------------------------------------------------------------------------------------------------------------------
5132 load atomic seq_cst - singlethread - global *Same as corresponding *Same as corresponding
5133 - wavefront - local load atomic acquire, load atomic acquire,
5134 - generic except must generated except must generated
5135 all instructions even all instructions even
5136 for OpenCL.* for OpenCL.*
5137 load atomic seq_cst - workgroup - global 1. s_waitcnt lgkmcnt(0) 1. s_waitcnt lgkmcnt(0) &
5138 - generic vmcnt(0) & vscnt(0)
5140 - If CU wavefront execution mode, omit vmcnt and
5142 - Could be split into
5145 vscnt(0) and s_waitcnt
5151 - Must - waitcnt lgkmcnt(0) must
5152 happen after happen after
5154 global/generic load local load
5155 atomic/store atomic/store
5156 atomic/atomicrmw atomic/atomicrmw
5157 with memory with memory
5158 ordering of seq_cst ordering of seq_cst
5159 and with equal or and with equal or
5160 wider sync scope. wider sync scope.
5161 (Note that seq_cst (Note that seq_cst
5162 fences have their fences have their
5163 own s_waitcnt own s_waitcnt
5164 lgkmcnt(0) and so do lgkmcnt(0) and so do
5165 not need to be not need to be
5166 considered.) considered.)
5172 atomicrmw-with-return-value
5186 global/generic store
5188 atomicrmw-no-return-value
5199 - Ensures any - Ensures any
5201 sequential sequential
5202 consistent local consistent global/local
5203 memory instructions memory instructions
5204 have completed have completed
5205 before executing before executing
5206 this sequentially this sequentially
5207 consistent consistent
5208 instruction. This instruction. This
5209 prevents reordering prevents reordering
5210 a seq_cst store a seq_cst store
5211 followed by a followed by a
5212 seq_cst load. (Note seq_cst load. (Note
5213 that seq_cst is that seq_cst is
5214 stronger than stronger than
5215 acquire/release as acquire/release as
5216 the reordering of the reordering of
5217 load acquire load acquire
5218 followed by a store followed by a store
5219 release is release is
5220 prevented by the prevented by the
5221 waitcnt of waitcnt of
5222 the release, but the release, but
5223 there is nothing there is nothing
5224 preventing a store preventing a store
5225 release followed by release followed by
5226 load acquire from load acquire from
5227 competing out of competing out of
5230 2. *Following 2. *Following
5231 instructions same as instructions same as
5232 corresponding load corresponding load
5233 atomic acquire, atomic acquire,
5234 except must generated except must generated
5235 all instructions even all instructions even
5236 for OpenCL.* for OpenCL.*
5237 load atomic seq_cst - workgroup - local *Same as corresponding
5238 load atomic acquire,
5239 except must generated
5240 all instructions even
5243 1. s_waitcnt vmcnt(0) & vscnt(0)
5245 - If CU wavefront execution mode, omit.
5246 - Could be split into
5248 vmcnt(0) and s_waitcnt
5259 atomicrmw-with-return-value
5273 global/generic store
5275 atomicrmw-no-return-value
5318 instructions same as
5321 except must generated
5322 all instructions even
5325 load atomic seq_cst - agent - global 1. s_waitcnt lgkmcnt(0) & 1. s_waitcnt lgkmcnt(0) &
5326 - system - generic vmcnt(0) vmcnt(0) & vscnt(0)
5328 - Could be split into - Could be split into
5329 separate s_waitcnt separate s_waitcnt
5330 vmcnt(0) vmcnt(0), s_waitcnt
5331 and s_waitcnt vscnt(0) and s_waitcnt
5332 lgkmcnt(0) to allow lgkmcnt(0) to allow
5333 them to be them to be
5334 independently moved independently moved
5335 according to the according to the
5336 following rules. following rules.
5337 - waitcnt lgkmcnt(0) - waitcnt lgkmcnt(0)
5338 must happen after must happen after
5340 global/generic load local load
5341 atomic/store atomic/store
5342 atomic/atomicrmw atomic/atomicrmw
5343 with memory with memory
5344 ordering of seq_cst ordering of seq_cst
5345 and with equal or and with equal or
5346 wider sync scope. wider sync scope.
5347 (Note that seq_cst (Note that seq_cst
5348 fences have their fences have their
5349 own s_waitcnt own s_waitcnt
5350 lgkmcnt(0) and so do lgkmcnt(0) and so do
5351 not need to be not need to be
5352 considered.) considered.)
5353 - waitcnt vmcnt(0) - waitcnt vmcnt(0)
5354 must happen after must happen after
5356 global/generic load global/generic load
5357 atomic/store atomic/
5358 atomic/atomicrmw atomicrmw-with-return-value
5359 with memory with memory
5360 ordering of seq_cst ordering of seq_cst
5361 and with equal or and with equal or
5362 wider sync scope. wider sync scope.
5363 (Note that seq_cst (Note that seq_cst
5364 fences have their fences have their
5365 own s_waitcnt own s_waitcnt
5366 vmcnt(0) and so do vmcnt(0) and so do
5367 not need to be not need to be
5368 considered.) considered.)
5372 global/generic store
5374 atomicrmw-no-return-value
5385 - Ensures any - Ensures any
5387 sequential sequential
5388 consistent global consistent global
5389 memory instructions memory instructions
5390 have completed have completed
5391 before executing before executing
5392 this sequentially this sequentially
5393 consistent consistent
5394 instruction. This instruction. This
5395 prevents reordering prevents reordering
5396 a seq_cst store a seq_cst store
5397 followed by a followed by a
5398 seq_cst load. (Note seq_cst load. (Note
5399 that seq_cst is that seq_cst is
5400 stronger than stronger than
5401 acquire/release as acquire/release as
5402 the reordering of the reordering of
5403 load acquire load acquire
5404 followed by a store followed by a store
5405 release is release is
5406 prevented by the prevented by the
5407 waitcnt of waitcnt of
5408 the release, but the release, but
5409 there is nothing there is nothing
5410 preventing a store preventing a store
5411 release followed by release followed by
5412 load acquire from load acquire from
5413 competing out of competing out of
5416 2. *Following 2. *Following
5417 instructions same as instructions same as
5418 corresponding load corresponding load
5419 atomic acquire, atomic acquire,
5420 except must generated except must generated
5421 all instructions even all instructions even
5422 for OpenCL.* for OpenCL.*
5423 store atomic seq_cst - singlethread - global *Same as corresponding *Same as corresponding
5424 - wavefront - local store atomic release, store atomic release,
5425 - workgroup - generic except must generated except must generated
5426 all instructions even all instructions even
5427 for OpenCL.* for OpenCL.*
5428 store atomic seq_cst - agent - global *Same as corresponding *Same as corresponding
5429 - system - generic store atomic release, store atomic release,
5430 except must generated except must generated
5431 all instructions even all instructions even
5432 for OpenCL.* for OpenCL.*
5433 atomicrmw seq_cst - singlethread - global *Same as corresponding *Same as corresponding
5434 - wavefront - local atomicrmw acq_rel, atomicrmw acq_rel,
5435 - workgroup - generic except must generated except must generated
5436 all instructions even all instructions even
5437 for OpenCL.* for OpenCL.*
5438 atomicrmw seq_cst - agent - global *Same as corresponding *Same as corresponding
5439 - system - generic atomicrmw acq_rel, atomicrmw acq_rel,
5440 except must generated except must generated
5441 all instructions even all instructions even
5442 for OpenCL.* for OpenCL.*
5443 fence seq_cst - singlethread *none* *Same as corresponding *Same as corresponding
5444 - wavefront fence acq_rel, fence acq_rel,
5445 - workgroup except must generated except must generated
5446 - agent all instructions even all instructions even
5447 - system for OpenCL.* for OpenCL.*
5448 ============ ============ ============== ========== =============================== ==================================
5450 The memory order also adds the single thread optimization constrains defined in
5452 :ref:`amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-gfx6-gfx10-table`.
5454 .. table:: AMDHSA Memory Model Single Thread Optimization Constraints GFX6-GFX10
5455 :name: amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-gfx6-gfx10-table
5457 ============ ==============================================================
5458 LLVM Memory Optimization Constraints
5460 ============ ==============================================================
5463 acquire - If a load atomic/atomicrmw then no following load/load
5464 atomic/store/ store atomic/atomicrmw/fence instruction can
5465 be moved before the acquire.
5466 - If a fence then same as load atomic, plus no preceding
5467 associated fence-paired-atomic can be moved after the fence.
5468 release - If a store atomic/atomicrmw then no preceding load/load
5469 atomic/store/ store atomic/atomicrmw/fence instruction can
5470 be moved after the release.
5471 - If a fence then same as store atomic, plus no following
5472 associated fence-paired-atomic can be moved before the
5474 acq_rel Same constraints as both acquire and release.
5475 seq_cst - If a load atomic then same constraints as acquire, plus no
5476 preceding sequentially consistent load atomic/store
5477 atomic/atomicrmw/fence instruction can be moved after the
5479 - If a store atomic then the same constraints as release, plus
5480 no following sequentially consistent load atomic/store
5481 atomic/atomicrmw/fence instruction can be moved before the
5483 - If an atomicrmw/fence then same constraints as acq_rel.
5484 ============ ==============================================================
5489 For code objects generated by AMDGPU backend for HSA [HSA]_ compatible runtimes
5490 (such as ROCm [AMD-ROCm]_), the runtime installs a trap handler that supports
5491 the ``s_trap`` instruction with the following usage:
5493 .. table:: AMDGPU Trap Handler for AMDHSA OS
5494 :name: amdgpu-trap-handler-for-amdhsa-os-table
5496 =================== =============== =============== =======================
5497 Usage Code Sequence Trap Handler Description
5499 =================== =============== =============== =======================
5500 reserved ``s_trap 0x00`` Reserved by hardware.
5501 ``debugtrap(arg)`` ``s_trap 0x01`` ``SGPR0-1``: Reserved for HSA
5502 ``queue_ptr`` ``debugtrap``
5503 ``VGPR0``: intrinsic (not
5504 ``arg`` implemented).
5505 ``llvm.trap`` ``s_trap 0x02`` ``SGPR0-1``: Causes dispatch to be
5506 ``queue_ptr`` terminated and its
5507 associated queue put
5508 into the error state.
5509 ``llvm.debugtrap`` ``s_trap 0x03`` - If debugger not
5519 - If the debugger is
5521 the debug trap to be
5525 the halt state until
5528 reserved ``s_trap 0x04`` Reserved.
5529 reserved ``s_trap 0x05`` Reserved.
5530 reserved ``s_trap 0x06`` Reserved.
5531 debugger breakpoint ``s_trap 0x07`` Reserved for debugger
5533 reserved ``s_trap 0x08`` Reserved.
5534 reserved ``s_trap 0xfe`` Reserved.
5535 reserved ``s_trap 0xff`` Reserved.
5536 =================== =============== =============== =======================
5541 This section provides code conventions used when the target triple OS is
5542 ``amdpal`` (see :ref:`amdgpu-target-triples`) for passing runtime parameters
5543 from the application/runtime to each invocation of a hardware shader. These
5544 parameters include both generic, application-controlled parameters called
5545 *user data* as well as system-generated parameters that are a product of the
5546 draw or dispatch execution.
5551 Each hardware stage has a set of 32-bit *user data registers* which can be
5552 written from a command buffer and then loaded into SGPRs when waves are launched
5553 via a subsequent dispatch or draw operation. This is the way most arguments are
5554 passed from the application/runtime to a hardware shader.
5559 Compute shader user data mappings are simpler than graphics shaders, and have a
5562 Note that there are always 10 available *user data entries* in registers -
5563 entries beyond that limit must be fetched from memory (via the spill table
5564 pointer) by the shader.
5566 .. table:: PAL Compute Shader User Data Registers
5567 :name: pal-compute-user-data-registers
5569 ============= ================================
5570 User Register Description
5571 ============= ================================
5572 0 Global Internal Table (32-bit pointer)
5573 1 Per-Shader Internal Table (32-bit pointer)
5574 2 - 11 Application-Controlled User Data (10 32-bit values)
5575 12 Spill Table (32-bit pointer)
5576 13 - 14 Thread Group Count (64-bit pointer)
5578 ============= ================================
5583 Graphics pipelines support a much more flexible user data mapping:
5585 .. table:: PAL Graphics Shader User Data Registers
5586 :name: pal-graphics-user-data-registers
5588 ============= ================================
5589 User Register Description
5590 ============= ================================
5591 0 Global Internal Table (32-bit pointer)
5592 + Per-Shader Internal Table (32-bit pointer)
5593 + 1-15 Application Controlled User Data
5594 (1-15 Contiguous 32-bit Values in Registers)
5595 + Spill Table (32-bit pointer)
5596 + Draw Index (First Stage Only)
5597 + Vertex Offset (First Stage Only)
5598 + Instance Offset (First Stage Only)
5599 ============= ================================
5601 The placement of the global internal table remains fixed in the first *user
5602 data SGPR register*. Otherwise all parameters are optional, and can be mapped
5603 to any desired *user data SGPR register*, with the following regstrictions:
5605 * Draw Index, Vertex Offset, and Instance Offset can only be used by the first
5606 activehardware stage in a graphics pipeline (i.e. where the API vertex
5609 * Application-controlled user data must be mapped into a contiguous range of
5610 user data registers.
5612 * The application-controlled user data range supports compaction remapping, so
5613 only *entries* that are actually consumed by the shader must be assigned to
5614 corresponding *registers*. Note that in order to support an efficient runtime
5615 implementation, the remapping must pack *registers* in the same order as
5616 *entries*, with unused *entries* removed.
5618 .. _pal_global_internal_table:
5620 Global Internal Table
5621 ~~~~~~~~~~~~~~~~~~~~~
5623 The global internal table is a table of *shader resource descriptors* (SRDs) that
5624 define how certain engine-wide, runtime-managed resources should be accessed
5625 from a shader. The majority of these resources have HW-defined formats, and it
5626 is up to the compiler to write/read data as required by the target hardware.
5628 The following table illustrates the required format:
5630 .. table:: PAL Global Internal Table
5631 :name: pal-git-table
5633 ============= ================================
5635 ============= ================================
5636 0-3 Graphics Scratch SRD
5637 4-7 Compute Scratch SRD
5638 8-11 ES/GS Ring Output SRD
5639 12-15 ES/GS Ring Input SRD
5640 16-19 GS/VS Ring Output #0
5641 20-23 GS/VS Ring Output #1
5642 24-27 GS/VS Ring Output #2
5643 28-31 GS/VS Ring Output #3
5644 32-35 GS/VS Ring Input SRD
5645 36-39 Tessellation Factor Buffer SRD
5646 40-43 Off-Chip LDS Buffer SRD
5647 44-47 Off-Chip Param Cache Buffer SRD
5648 48-51 Sample Position Buffer SRD
5649 52 vaRange::ShadowDescriptorTable High Bits
5650 ============= ================================
5652 The pointer to the global internal table passed to the shader as user data
5653 is a 32-bit pointer. The top 32 bits should be assumed to be the same as
5654 the top 32 bits of the pipeline, so the shader may use the program
5655 counter's top 32 bits.
5660 This section provides code conventions used when the target triple OS is
5661 empty (see :ref:`amdgpu-target-triples`).
5666 For code objects generated by AMDGPU backend for non-amdhsa OS, the runtime does
5667 not install a trap handler. The ``llvm.trap`` and ``llvm.debugtrap``
5668 instructions are handled as follows:
5670 .. table:: AMDGPU Trap Handler for Non-AMDHSA OS
5671 :name: amdgpu-trap-handler-for-non-amdhsa-os-table
5673 =============== =============== ===========================================
5674 Usage Code Sequence Description
5675 =============== =============== ===========================================
5676 llvm.trap s_endpgm Causes wavefront to be terminated.
5677 llvm.debugtrap *none* Compiler warning given that there is no
5678 trap handler installed.
5679 =============== =============== ===========================================
5689 When the language is OpenCL the following differences occur:
5691 1. The OpenCL memory model is used (see :ref:`amdgpu-amdhsa-memory-model`).
5692 2. The AMDGPU backend appends additional arguments to the kernel's explicit
5693 arguments for the AMDHSA OS (see
5694 :ref:`opencl-kernel-implicit-arguments-appended-for-amdhsa-os-table`).
5695 3. Additional metadata is generated
5696 (see :ref:`amdgpu-amdhsa-code-object-metadata`).
5698 .. table:: OpenCL kernel implicit arguments appended for AMDHSA OS
5699 :name: opencl-kernel-implicit-arguments-appended-for-amdhsa-os-table
5701 ======== ==== ========= ===========================================
5702 Position Byte Byte Description
5704 ======== ==== ========= ===========================================
5705 1 8 8 OpenCL Global Offset X
5706 2 8 8 OpenCL Global Offset Y
5707 3 8 8 OpenCL Global Offset Z
5708 4 8 8 OpenCL address of printf buffer
5709 5 8 8 OpenCL address of virtual queue used by
5711 6 8 8 OpenCL address of AqlWrap struct used by
5713 7 8 8 Pointer argument used for Multi-gird
5715 ======== ==== ========= ===========================================
5722 When the language is HCC the following differences occur:
5724 1. The HSA memory model is used (see :ref:`amdgpu-amdhsa-memory-model`).
5726 .. _amdgpu-assembler:
5731 AMDGPU backend has LLVM-MC based assembler which is currently in development.
5732 It supports AMDGCN GFX6-GFX10.
5734 This section describes general syntax for instructions and operands.
5742 AMDGPU/AMDGPUAsmGFX7
5743 AMDGPU/AMDGPUAsmGFX8
5744 AMDGPU/AMDGPUAsmGFX9
5745 AMDGPU/AMDGPUAsmGFX10
5746 AMDGPUModifierSyntax
5748 AMDGPUInstructionSyntax
5749 AMDGPUInstructionNotation
5751 An instruction has the following :doc:`syntax<AMDGPUInstructionSyntax>`:
5753 ``<``\ *opcode*\ ``> <``\ *operand0*\ ``>, <``\ *operand1*\ ``>,... <``\ *modifier0*\ ``> <``\ *modifier1*\ ``>...``
5755 :doc:`Operands<AMDGPUOperandSyntax>` are normally comma-separated while
5756 :doc:`modifiers<AMDGPUModifierSyntax>` are space-separated.
5758 The order of *operands* and *modifiers* is fixed.
5759 Most *modifiers* are optional and may be omitted.
5761 See detailed instruction syntax description for :doc:`GFX7<AMDGPU/AMDGPUAsmGFX7>`,
5762 :doc:`GFX8<AMDGPU/AMDGPUAsmGFX8>`, :doc:`GFX9<AMDGPU/AMDGPUAsmGFX9>`
5763 and :doc:`GFX10<AMDGPU/AMDGPUAsmGFX10>`.
5765 Note that features under development are not included in this description.
5767 For more information about instructions, their semantics and supported combinations of
5768 operands, refer to one of instruction set architecture manuals
5769 [AMD-GCN-GFX6]_, [AMD-GCN-GFX7]_, [AMD-GCN-GFX8]_, [AMD-GCN-GFX9]_ and
5775 Detailed description of operands may be found :doc:`here<AMDGPUOperandSyntax>`.
5780 Detailed description of modifiers may be found :doc:`here<AMDGPUModifierSyntax>`.
5782 Instruction Examples
5783 ~~~~~~~~~~~~~~~~~~~~
5788 .. code-block:: nasm
5790 ds_add_u32 v2, v4 offset:16
5791 ds_write_src2_b64 v2 offset0:4 offset1:8
5792 ds_cmpst_f32 v2, v4, v6
5793 ds_min_rtn_f64 v[8:9], v2, v[4:5]
5796 For full list of supported instructions, refer to "LDS/GDS instructions" in ISA Manual.
5801 .. code-block:: nasm
5803 flat_load_dword v1, v[3:4]
5804 flat_store_dwordx3 v[3:4], v[5:7]
5805 flat_atomic_swap v1, v[3:4], v5 glc
5806 flat_atomic_cmpswap v1, v[3:4], v[5:6] glc slc
5807 flat_atomic_fmax_x2 v[1:2], v[3:4], v[5:6] glc
5809 For full list of supported instructions, refer to "FLAT instructions" in ISA Manual.
5814 .. code-block:: nasm
5816 buffer_load_dword v1, off, s[4:7], s1
5817 buffer_store_dwordx4 v[1:4], v2, ttmp[4:7], s1 offen offset:4 glc tfe
5818 buffer_store_format_xy v[1:2], off, s[4:7], s1
5820 buffer_atomic_inc v1, v2, s[8:11], s4 idxen offset:4 slc
5822 For full list of supported instructions, refer to "MUBUF Instructions" in ISA Manual.
5827 .. code-block:: nasm
5829 s_load_dword s1, s[2:3], 0xfc
5830 s_load_dwordx8 s[8:15], s[2:3], s4
5831 s_load_dwordx16 s[88:103], s[2:3], s4
5835 For full list of supported instructions, refer to "Scalar Memory Operations" in ISA Manual.
5840 .. code-block:: nasm
5843 s_mov_b64 s[0:1], 0x80000000
5845 s_wqm_b64 s[2:3], s[4:5]
5846 s_bcnt0_i32_b64 s1, s[2:3]
5847 s_swappc_b64 s[2:3], s[4:5]
5848 s_cbranch_join s[4:5]
5850 For full list of supported instructions, refer to "SOP1 Instructions" in ISA Manual.
5855 .. code-block:: nasm
5857 s_add_u32 s1, s2, s3
5858 s_and_b64 s[2:3], s[4:5], s[6:7]
5859 s_cselect_b32 s1, s2, s3
5860 s_andn2_b32 s2, s4, s6
5861 s_lshr_b64 s[2:3], s[4:5], s6
5862 s_ashr_i32 s2, s4, s6
5863 s_bfm_b64 s[2:3], s4, s6
5864 s_bfe_i64 s[2:3], s[4:5], s6
5865 s_cbranch_g_fork s[4:5], s[6:7]
5867 For full list of supported instructions, refer to "SOP2 Instructions" in ISA Manual.
5872 .. code-block:: nasm
5875 s_bitcmp1_b32 s1, s2
5876 s_bitcmp0_b64 s[2:3], s4
5879 For full list of supported instructions, refer to "SOPC Instructions" in ISA Manual.
5884 .. code-block:: nasm
5889 s_waitcnt 0 ; Wait for all counters to be 0
5890 s_waitcnt vmcnt(0) & expcnt(0) & lgkmcnt(0) ; Equivalent to above
5891 s_waitcnt vmcnt(1) ; Wait for vmcnt counter to be 1.
5895 s_sendmsg sendmsg(MSG_INTERRUPT)
5898 For full list of supported instructions, refer to "SOPP Instructions" in ISA Manual.
5900 Unless otherwise mentioned, little verification is performed on the operands
5901 of SOPP Instructions, so it is up to the programmer to be familiar with the
5902 range or acceptable values.
5907 For vector ALU instruction opcodes (VOP1, VOP2, VOP3, VOPC, VOP_DPP, VOP_SDWA),
5908 the assembler will automatically use optimal encoding based on its operands.
5909 To force specific encoding, one can add a suffix to the opcode of the instruction:
5911 * _e32 for 32-bit VOP1/VOP2/VOPC
5912 * _e64 for 64-bit VOP3
5914 * _sdwa for VOP_SDWA
5916 VOP1/VOP2/VOP3/VOPC examples:
5918 .. code-block:: nasm
5921 v_mov_b32_e32 v1, v2
5923 v_cvt_f64_i32_e32 v[1:2], v2
5924 v_floor_f32_e32 v1, v2
5925 v_bfrev_b32_e32 v1, v2
5926 v_add_f32_e32 v1, v2, v3
5927 v_mul_i32_i24_e64 v1, v2, 3
5928 v_mul_i32_i24_e32 v1, -3, v3
5929 v_mul_i32_i24_e32 v1, -100, v3
5930 v_addc_u32 v1, s[0:1], v2, v3, s[2:3]
5931 v_max_f16_e32 v1, v2, v3
5935 .. code-block:: nasm
5937 v_mov_b32 v0, v0 quad_perm:[0,2,1,1]
5938 v_sin_f32 v0, v0 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
5939 v_mov_b32 v0, v0 wave_shl:1
5940 v_mov_b32 v0, v0 row_mirror
5941 v_mov_b32 v0, v0 row_bcast:31
5942 v_mov_b32 v0, v0 quad_perm:[1,3,0,1] row_mask:0xa bank_mask:0x1 bound_ctrl:0
5943 v_add_f32 v0, v0, |v0| row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
5944 v_max_f16 v1, v2, v3 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
5948 .. code-block:: nasm
5950 v_mov_b32 v1, v2 dst_sel:BYTE_0 dst_unused:UNUSED_PRESERVE src0_sel:DWORD
5951 v_min_u32 v200, v200, v1 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_1 src1_sel:DWORD
5952 v_sin_f32 v0, v0 dst_unused:UNUSED_PAD src0_sel:WORD_1
5953 v_fract_f32 v0, |v0| dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1
5954 v_cmpx_le_u32 vcc, v1, v2 src0_sel:BYTE_2 src1_sel:WORD_0
5956 For full list of supported instructions, refer to "Vector ALU instructions".
5959 Remove once we switch to code object v3 by default.
5961 .. _amdgpu-amdhsa-assembler-predefined-symbols-v2:
5963 Code Object V2 Predefined Symbols (-mattr=-code-object-v3)
5964 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
5966 .. warning:: Code Object V2 is not the default code object version emitted by
5967 this version of LLVM. For a description of the predefined symbols available
5968 with the default configuration (Code Object V3) see
5969 :ref:`amdgpu-amdhsa-assembler-predefined-symbols-v3`.
5971 The AMDGPU assembler defines and updates some symbols automatically. These
5972 symbols do not affect code generation.
5974 .option.machine_version_major
5975 +++++++++++++++++++++++++++++
5977 Set to the GFX major generation number of the target being assembled for. For
5978 example, when assembling for a "GFX9" target this will be set to the integer
5979 value "9". The possible GFX major generation numbers are presented in
5980 :ref:`amdgpu-processors`.
5982 .option.machine_version_minor
5983 +++++++++++++++++++++++++++++
5985 Set to the GFX minor generation number of the target being assembled for. For
5986 example, when assembling for a "GFX810" target this will be set to the integer
5987 value "1". The possible GFX minor generation numbers are presented in
5988 :ref:`amdgpu-processors`.
5990 .option.machine_version_stepping
5991 ++++++++++++++++++++++++++++++++
5993 Set to the GFX stepping generation number of the target being assembled for.
5994 For example, when assembling for a "GFX704" target this will be set to the
5995 integer value "4". The possible GFX stepping generation numbers are presented
5996 in :ref:`amdgpu-processors`.
6001 Set to zero each time a
6002 :ref:`amdgpu-amdhsa-assembler-directive-amdgpu_hsa_kernel` directive is
6003 encountered. At each instruction, if the current value of this symbol is less
6004 than or equal to the maximum VPGR number explicitly referenced within that
6005 instruction then the symbol value is updated to equal that VGPR number plus
6011 Set to zero each time a
6012 :ref:`amdgpu-amdhsa-assembler-directive-amdgpu_hsa_kernel` directive is
6013 encountered. At each instruction, if the current value of this symbol is less
6014 than or equal to the maximum VPGR number explicitly referenced within that
6015 instruction then the symbol value is updated to equal that SGPR number plus
6018 .. _amdgpu-amdhsa-assembler-directives-v2:
6020 Code Object V2 Directives (-mattr=-code-object-v3)
6021 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
6023 .. warning:: Code Object V2 is not the default code object version emitted by
6024 this version of LLVM. For a description of the directives supported with
6025 the default configuration (Code Object V3) see
6026 :ref:`amdgpu-amdhsa-assembler-directives-v3`.
6028 AMDGPU ABI defines auxiliary data in output code object. In assembly source,
6029 one can specify them with assembler directives.
6031 .hsa_code_object_version major, minor
6032 +++++++++++++++++++++++++++++++++++++
6034 *major* and *minor* are integers that specify the version of the HSA code
6035 object that will be generated by the assembler.
6037 .hsa_code_object_isa [major, minor, stepping, vendor, arch]
6038 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
6041 *major*, *minor*, and *stepping* are all integers that describe the instruction
6042 set architecture (ISA) version of the assembly program.
6044 *vendor* and *arch* are quoted strings. *vendor* should always be equal to
6045 "AMD" and *arch* should always be equal to "AMDGPU".
6047 By default, the assembler will derive the ISA version, *vendor*, and *arch*
6048 from the value of the -mcpu option that is passed to the assembler.
6050 .. _amdgpu-amdhsa-assembler-directive-amdgpu_hsa_kernel:
6052 .amdgpu_hsa_kernel (name)
6053 +++++++++++++++++++++++++
6055 This directives specifies that the symbol with given name is a kernel entry point
6056 (label) and the object should contain corresponding symbol of type STT_AMDGPU_HSA_KERNEL.
6061 This directive marks the beginning of a list of key / value pairs that are used
6062 to specify the amd_kernel_code_t object that will be emitted by the assembler.
6063 The list must be terminated by the *.end_amd_kernel_code_t* directive. For
6064 any amd_kernel_code_t values that are unspecified a default value will be
6065 used. The default value for all keys is 0, with the following exceptions:
6067 - *amd_code_version_major* defaults to 1.
6068 - *amd_kernel_code_version_minor* defaults to 2.
6069 - *amd_machine_kind* defaults to 1.
6070 - *amd_machine_version_major*, *machine_version_minor*, and
6071 *amd_machine_version_stepping* are derived from the value of the -mcpu option
6072 that is passed to the assembler.
6073 - *kernel_code_entry_byte_offset* defaults to 256.
6074 - *wavefront_size* defaults 6 for all targets before GFX10. For GFX10 onwards
6075 defaults to 6 if target feature ``wavefrontsize64`` is enabled, otherwise 5.
6076 Note that wavefront size is specified as a power of two, so a value of **n**
6077 means a size of 2^ **n**.
6078 - *call_convention* defaults to -1.
6079 - *kernarg_segment_alignment*, *group_segment_alignment*, and
6080 *private_segment_alignment* default to 4. Note that alignments are specified
6081 as a power of 2, so a value of **n** means an alignment of 2^ **n**.
6082 - *enable_wgp_mode* defaults to 1 if target feature ``cumode`` is disabled for
6084 - *enable_mem_ordered* defaults to 1 for GFX10 onwards.
6086 The *.amd_kernel_code_t* directive must be placed immediately after the
6087 function label and before any instructions.
6089 For a full list of amd_kernel_code_t keys, refer to AMDGPU ABI document,
6090 comments in lib/Target/AMDGPU/AmdKernelCodeT.h and test/CodeGen/AMDGPU/hsa.s.
6092 .. _amdgpu-amdhsa-assembler-example-v2:
6094 Code Object V2 Example Source Code (-mattr=-code-object-v3)
6095 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
6097 .. warning:: Code Object V2 is not the default code object version emitted by
6098 this version of LLVM. For a description of the directives supported with
6099 the default configuration (Code Object V3) see
6100 :ref:`amdgpu-amdhsa-assembler-example-v3`.
6102 Here is an example of a minimal assembly source file, defining one HSA kernel:
6104 .. code-block:: none
6106 .hsa_code_object_version 1,0
6107 .hsa_code_object_isa
6112 .amdgpu_hsa_kernel hello_world
6117 enable_sgpr_kernarg_segment_ptr = 1
6119 compute_pgm_rsrc1_vgprs = 0
6120 compute_pgm_rsrc1_sgprs = 0
6121 compute_pgm_rsrc2_user_sgpr = 2
6122 compute_pgm_rsrc1_wgp_mode = 0
6123 compute_pgm_rsrc1_mem_ordered = 0
6124 compute_pgm_rsrc1_fwd_progress = 1
6125 .end_amd_kernel_code_t
6127 s_load_dwordx2 s[0:1], s[0:1] 0x0
6128 v_mov_b32 v0, 3.14159
6129 s_waitcnt lgkmcnt(0)
6132 flat_store_dword v[1:2], v0
6135 .size hello_world, .Lfunc_end0-hello_world
6137 .. _amdgpu-amdhsa-assembler-predefined-symbols-v3:
6139 Code Object V3 Predefined Symbols (-mattr=+code-object-v3)
6140 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
6142 The AMDGPU assembler defines and updates some symbols automatically. These
6143 symbols do not affect code generation.
6145 .amdgcn.gfx_generation_number
6146 +++++++++++++++++++++++++++++
6148 Set to the GFX major generation number of the target being assembled for. For
6149 example, when assembling for a "GFX9" target this will be set to the integer
6150 value "9". The possible GFX major generation numbers are presented in
6151 :ref:`amdgpu-processors`.
6153 .amdgcn.gfx_generation_minor
6154 ++++++++++++++++++++++++++++
6156 Set to the GFX minor generation number of the target being assembled for. For
6157 example, when assembling for a "GFX810" target this will be set to the integer
6158 value "1". The possible GFX minor generation numbers are presented in
6159 :ref:`amdgpu-processors`.
6161 .amdgcn.gfx_generation_stepping
6162 +++++++++++++++++++++++++++++++
6164 Set to the GFX stepping generation number of the target being assembled for.
6165 For example, when assembling for a "GFX704" target this will be set to the
6166 integer value "4". The possible GFX stepping generation numbers are presented
6167 in :ref:`amdgpu-processors`.
6169 .. _amdgpu-amdhsa-assembler-symbol-next_free_vgpr:
6171 .amdgcn.next_free_vgpr
6172 ++++++++++++++++++++++
6174 Set to zero before assembly begins. At each instruction, if the current value
6175 of this symbol is less than or equal to the maximum VGPR number explicitly
6176 referenced within that instruction then the symbol value is updated to equal
6177 that VGPR number plus one.
6179 May be used to set the `.amdhsa_next_free_vpgr` directive in
6180 :ref:`amdhsa-kernel-directives-table`.
6182 May be set at any time, e.g. manually set to zero at the start of each kernel.
6184 .. _amdgpu-amdhsa-assembler-symbol-next_free_sgpr:
6186 .amdgcn.next_free_sgpr
6187 ++++++++++++++++++++++
6189 Set to zero before assembly begins. At each instruction, if the current value
6190 of this symbol is less than or equal the maximum SGPR number explicitly
6191 referenced within that instruction then the symbol value is updated to equal
6192 that SGPR number plus one.
6194 May be used to set the `.amdhsa_next_free_spgr` directive in
6195 :ref:`amdhsa-kernel-directives-table`.
6197 May be set at any time, e.g. manually set to zero at the start of each kernel.
6199 .. _amdgpu-amdhsa-assembler-directives-v3:
6201 Code Object V3 Directives (-mattr=+code-object-v3)
6202 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
6204 Directives which begin with ``.amdgcn`` are valid for all ``amdgcn``
6205 architecture processors, and are not OS-specific. Directives which begin with
6206 ``.amdhsa`` are specific to ``amdgcn`` architecture processors when the
6207 ``amdhsa`` OS is specified. See :ref:`amdgpu-target-triples` and
6208 :ref:`amdgpu-processors`.
6210 .amdgcn_target <target>
6211 +++++++++++++++++++++++
6213 Optional directive which declares the target supported by the containing
6214 assembler source file. Valid values are described in
6215 :ref:`amdgpu-amdhsa-code-object-target-identification`. Used by the assembler
6216 to validate command-line options such as ``-triple``, ``-mcpu``, and those
6217 which specify target features.
6219 .amdhsa_kernel <name>
6220 +++++++++++++++++++++
6222 Creates a correctly aligned AMDHSA kernel descriptor and a symbol,
6223 ``<name>.kd``, in the current location of the current section. Only valid when
6224 the OS is ``amdhsa``. ``<name>`` must be a symbol that labels the first
6225 instruction to execute, and does not need to be previously defined.
6227 Marks the beginning of a list of directives used to generate the bytes of a
6228 kernel descriptor, as described in :ref:`amdgpu-amdhsa-kernel-descriptor`.
6229 Directives which may appear in this list are described in
6230 :ref:`amdhsa-kernel-directives-table`. Directives may appear in any order, must
6231 be valid for the target being assembled for, and cannot be repeated. Directives
6232 support the range of values specified by the field they reference in
6233 :ref:`amdgpu-amdhsa-kernel-descriptor`. If a directive is not specified, it is
6234 assumed to have its default value, unless it is marked as "Required", in which
6235 case it is an error to omit the directive. This list of directives is
6236 terminated by an ``.end_amdhsa_kernel`` directive.
6238 .. table:: AMDHSA Kernel Assembler Directives
6239 :name: amdhsa-kernel-directives-table
6241 ======================================================== =================== ============ ===================
6242 Directive Default Supported On Description
6243 ======================================================== =================== ============ ===================
6244 ``.amdhsa_group_segment_fixed_size`` 0 GFX6-GFX10 Controls GROUP_SEGMENT_FIXED_SIZE in
6245 :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx10-table`.
6246 ``.amdhsa_private_segment_fixed_size`` 0 GFX6-GFX10 Controls PRIVATE_SEGMENT_FIXED_SIZE in
6247 :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx10-table`.
6248 ``.amdhsa_user_sgpr_private_segment_buffer`` 0 GFX6-GFX10 Controls ENABLE_SGPR_PRIVATE_SEGMENT_BUFFER in
6249 :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx10-table`.
6250 ``.amdhsa_user_sgpr_dispatch_ptr`` 0 GFX6-GFX10 Controls ENABLE_SGPR_DISPATCH_PTR in
6251 :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx10-table`.
6252 ``.amdhsa_user_sgpr_queue_ptr`` 0 GFX6-GFX10 Controls ENABLE_SGPR_QUEUE_PTR in
6253 :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx10-table`.
6254 ``.amdhsa_user_sgpr_kernarg_segment_ptr`` 0 GFX6-GFX10 Controls ENABLE_SGPR_KERNARG_SEGMENT_PTR in
6255 :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx10-table`.
6256 ``.amdhsa_user_sgpr_dispatch_id`` 0 GFX6-GFX10 Controls ENABLE_SGPR_DISPATCH_ID in
6257 :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx10-table`.
6258 ``.amdhsa_user_sgpr_flat_scratch_init`` 0 GFX6-GFX10 Controls ENABLE_SGPR_FLAT_SCRATCH_INIT in
6259 :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx10-table`.
6260 ``.amdhsa_user_sgpr_private_segment_size`` 0 GFX6-GFX10 Controls ENABLE_SGPR_PRIVATE_SEGMENT_SIZE in
6261 :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx10-table`.
6262 ``.amdhsa_wavefront_size32`` Target GFX10 Controls ENABLE_WAVEFRONT_SIZE32 in
6263 Feature :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx10-table`.
6266 ``.amdhsa_system_sgpr_private_segment_wavefront_offset`` 0 GFX6-GFX10 Controls ENABLE_SGPR_PRIVATE_SEGMENT_WAVEFRONT_OFFSET in
6267 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`.
6268 ``.amdhsa_system_sgpr_workgroup_id_x`` 1 GFX6-GFX10 Controls ENABLE_SGPR_WORKGROUP_ID_X in
6269 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`.
6270 ``.amdhsa_system_sgpr_workgroup_id_y`` 0 GFX6-GFX10 Controls ENABLE_SGPR_WORKGROUP_ID_Y in
6271 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`.
6272 ``.amdhsa_system_sgpr_workgroup_id_z`` 0 GFX6-GFX10 Controls ENABLE_SGPR_WORKGROUP_ID_Z in
6273 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`.
6274 ``.amdhsa_system_sgpr_workgroup_info`` 0 GFX6-GFX10 Controls ENABLE_SGPR_WORKGROUP_INFO in
6275 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`.
6276 ``.amdhsa_system_vgpr_workitem_id`` 0 GFX6-GFX10 Controls ENABLE_VGPR_WORKITEM_ID in
6277 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`.
6278 Possible values are defined in
6279 :ref:`amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table`.
6280 ``.amdhsa_next_free_vgpr`` Required GFX6-GFX10 Maximum VGPR number explicitly referenced, plus one.
6281 Used to calculate GRANULATED_WORKITEM_VGPR_COUNT in
6282 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
6283 ``.amdhsa_next_free_sgpr`` Required GFX6-GFX10 Maximum SGPR number explicitly referenced, plus one.
6284 Used to calculate GRANULATED_WAVEFRONT_SGPR_COUNT in
6285 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
6286 ``.amdhsa_reserve_vcc`` 1 GFX6-GFX10 Whether the kernel may use the special VCC SGPR.
6287 Used to calculate GRANULATED_WAVEFRONT_SGPR_COUNT in
6288 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
6289 ``.amdhsa_reserve_flat_scratch`` 1 GFX7-GFX10 Whether the kernel may use flat instructions to access
6290 scratch memory. Used to calculate
6291 GRANULATED_WAVEFRONT_SGPR_COUNT in
6292 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
6293 ``.amdhsa_reserve_xnack_mask`` Target GFX8-GFX10 Whether the kernel may trigger XNACK replay.
6294 Feature Used to calculate GRANULATED_WAVEFRONT_SGPR_COUNT in
6295 Specific :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
6297 ``.amdhsa_float_round_mode_32`` 0 GFX6-GFX10 Controls FLOAT_ROUND_MODE_32 in
6298 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
6299 Possible values are defined in
6300 :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
6301 ``.amdhsa_float_round_mode_16_64`` 0 GFX6-GFX10 Controls FLOAT_ROUND_MODE_16_64 in
6302 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
6303 Possible values are defined in
6304 :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
6305 ``.amdhsa_float_denorm_mode_32`` 0 GFX6-GFX10 Controls FLOAT_DENORM_MODE_32 in
6306 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
6307 Possible values are defined in
6308 :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
6309 ``.amdhsa_float_denorm_mode_16_64`` 3 GFX6-GFX10 Controls FLOAT_DENORM_MODE_16_64 in
6310 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
6311 Possible values are defined in
6312 :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
6313 ``.amdhsa_dx10_clamp`` 1 GFX6-GFX10 Controls ENABLE_DX10_CLAMP in
6314 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
6315 ``.amdhsa_ieee_mode`` 1 GFX6-GFX10 Controls ENABLE_IEEE_MODE in
6316 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
6317 ``.amdhsa_fp16_overflow`` 0 GFX9-GFX10 Controls FP16_OVFL in
6318 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
6319 ``.amdhsa_workgroup_processor_mode`` Target GFX10 Controls ENABLE_WGP_MODE in
6320 Feature :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx10-table`.
6323 ``.amdhsa_memory_ordered`` 1 GFX10 Controls MEM_ORDERED in
6324 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
6325 ``.amdhsa_forward_progress`` 0 GFX10 Controls FWD_PROGRESS in
6326 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`.
6327 ``.amdhsa_exception_fp_ieee_invalid_op`` 0 GFX6-GFX10 Controls ENABLE_EXCEPTION_IEEE_754_FP_INVALID_OPERATION in
6328 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`.
6329 ``.amdhsa_exception_fp_denorm_src`` 0 GFX6-GFX10 Controls ENABLE_EXCEPTION_FP_DENORMAL_SOURCE in
6330 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`.
6331 ``.amdhsa_exception_fp_ieee_div_zero`` 0 GFX6-GFX10 Controls ENABLE_EXCEPTION_IEEE_754_FP_DIVISION_BY_ZERO in
6332 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`.
6333 ``.amdhsa_exception_fp_ieee_overflow`` 0 GFX6-GFX10 Controls ENABLE_EXCEPTION_IEEE_754_FP_OVERFLOW in
6334 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`.
6335 ``.amdhsa_exception_fp_ieee_underflow`` 0 GFX6-GFX10 Controls ENABLE_EXCEPTION_IEEE_754_FP_UNDERFLOW in
6336 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`.
6337 ``.amdhsa_exception_fp_ieee_inexact`` 0 GFX6-GFX10 Controls ENABLE_EXCEPTION_IEEE_754_FP_INEXACT in
6338 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`.
6339 ``.amdhsa_exception_int_div_zero`` 0 GFX6-GFX10 Controls ENABLE_EXCEPTION_INT_DIVIDE_BY_ZERO in
6340 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`.
6341 ======================================================== =================== ============ ===================
6346 Optional directive which declares the contents of the ``NT_AMDGPU_METADATA``
6347 note record (see :ref:`amdgpu-elf-note-records-table-v3`).
6349 The contents must be in the [YAML]_ markup format, with the same structure and
6350 semantics described in :ref:`amdgpu-amdhsa-code-object-metadata-v3`.
6352 This directive is terminated by an ``.end_amdgpu_metadata`` directive.
6354 .. _amdgpu-amdhsa-assembler-example-v3:
6356 Code Object V3 Example Source Code (-mattr=+code-object-v3)
6357 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
6359 Here is an example of a minimal assembly source file, defining one HSA kernel:
6361 .. code-block:: none
6363 .amdgcn_target "amdgcn-amd-amdhsa--gfx900+xnack" // optional
6368 .type hello_world,@function
6370 s_load_dwordx2 s[0:1], s[0:1] 0x0
6371 v_mov_b32 v0, 3.14159
6372 s_waitcnt lgkmcnt(0)
6375 flat_store_dword v[1:2], v0
6378 .size hello_world, .Lfunc_end0-hello_world
6382 .amdhsa_kernel hello_world
6383 .amdhsa_user_sgpr_kernarg_segment_ptr 1
6384 .amdhsa_next_free_vgpr .amdgcn.next_free_vgpr
6385 .amdhsa_next_free_sgpr .amdgcn.next_free_sgpr
6394 - .name: hello_world
6395 .symbol: hello_world.kd
6396 .kernarg_segment_size: 48
6397 .group_segment_fixed_size: 0
6398 .private_segment_fixed_size: 0
6399 .kernarg_segment_align: 4
6403 .max_flat_workgroup_size: 256
6405 .end_amdgpu_metadata
6407 If an assembly source file contains multiple kernels and/or functions, the
6408 :ref:`amdgpu-amdhsa-assembler-symbol-next_free_vgpr` and
6409 :ref:`amdgpu-amdhsa-assembler-symbol-next_free_sgpr` symbols may be reset using
6410 the ``.set <symbol>, <expression>`` directive. For example, in the case of two
6411 kernels, where ``function1`` is only called from ``kernel1`` it is sufficient
6412 to group the function with the kernel that calls it and reset the symbols
6413 between the two connected components:
6415 .. code-block:: none
6417 .amdgcn_target "amdgcn-amd-amdhsa--gfx900+xnack" // optional
6419 // gpr tracking symbols are implicitly set to zero
6424 .type kern0,@function
6429 .size kern0, .Lkern0_end-kern0
6433 .amdhsa_kernel kern0
6435 .amdhsa_next_free_vgpr .amdgcn.next_free_vgpr
6436 .amdhsa_next_free_sgpr .amdgcn.next_free_sgpr
6439 // reset symbols to begin tracking usage in func1 and kern1
6440 .set .amdgcn.next_free_vgpr, 0
6441 .set .amdgcn.next_free_sgpr, 0
6447 .type func1,@function
6450 s_setpc_b64 s[30:31]
6452 .size func1, .Lfunc1_end-func1
6456 .type kern1,@function
6460 s_add_u32 s4, s4, func1@rel32@lo+4
6461 s_addc_u32 s5, s5, func1@rel32@lo+4
6462 s_swappc_b64 s[30:31], s[4:5]
6466 .size kern1, .Lkern1_end-kern1
6470 .amdhsa_kernel kern1
6472 .amdhsa_next_free_vgpr .amdgcn.next_free_vgpr
6473 .amdhsa_next_free_sgpr .amdgcn.next_free_sgpr
6476 These symbols cannot identify connected components in order to automatically
6477 track the usage for each kernel. However, in some cases careful organization of
6478 the kernels and functions in the source file means there is minimal additional
6479 effort required to accurately calculate GPR usage.
6481 Additional Documentation
6482 ========================
6484 .. [AMD-RADEON-HD-2000-3000] `AMD R6xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R600_Instruction_Set_Architecture.pdf>`__
6485 .. [AMD-RADEON-HD-4000] `AMD R7xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R700-Family_Instruction_Set_Architecture.pdf>`__
6486 .. [AMD-RADEON-HD-5000] `AMD Evergreen shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_Evergreen-Family_Instruction_Set_Architecture.pdf>`__
6487 .. [AMD-RADEON-HD-6000] `AMD Cayman/Trinity shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_HD_6900_Series_Instruction_Set_Architecture.pdf>`__
6488 .. [AMD-GCN-GFX6] `AMD Southern Islands Series ISA <http://developer.amd.com/wordpress/media/2012/12/AMD_Southern_Islands_Instruction_Set_Architecture.pdf>`__
6489 .. [AMD-GCN-GFX7] `AMD Sea Islands Series ISA <http://developer.amd.com/wordpress/media/2013/07/AMD_Sea_Islands_Instruction_Set_Architecture.pdf>`_
6490 .. [AMD-GCN-GFX8] `AMD GCN3 Instruction Set Architecture <http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/12/AMD_GCN3_Instruction_Set_Architecture_rev1.1.pdf>`__
6491 .. [AMD-GCN-GFX9] `AMD "Vega" Instruction Set Architecture <http://developer.amd.com/wordpress/media/2013/12/Vega_Shader_ISA_28July2017.pdf>`__
6492 .. [AMD-GCN-GFX10] AMD "Navi" Instruction Set Architecture *TBA*
6494 ttye Add link when made public.
6495 .. [AMD-ROCm] `ROCm: Open Platform for Development, Discovery and Education Around GPU Computing <http://gpuopen.com/compute-product/rocm/>`__
6496 .. [AMD-ROCm-github] `ROCm github <http://github.com/RadeonOpenCompute>`__
6497 .. [HSA] `Heterogeneous System Architecture (HSA) Foundation <http://www.hsafoundation.com/>`__
6498 .. [ELF] `Executable and Linkable Format (ELF) <http://www.sco.com/developers/gabi/>`__
6499 .. [DWARF] `DWARF Debugging Information Format <http://dwarfstd.org/>`__
6500 .. [YAML] `YAML Ain't Markup Language (YAML™) Version 1.2 <http://www.yaml.org/spec/1.2/spec.html>`__
6501 .. [MsgPack] `Message Pack <http://www.msgpack.org/>`__
6502 .. [OpenCL] `The OpenCL Specification Version 2.0 <http://www.khronos.org/registry/cl/specs/opencl-2.0.pdf>`__
6503 .. [HRF] `Heterogeneous-race-free Memory Models <http://benedictgaster.org/wp-content/uploads/2014/01/asplos269-FINAL.pdf>`__
6504 .. [CLANG-ATTR] `Attributes in Clang <http://clang.llvm.org/docs/AttributeReference.html>`__