clang/docs/HIPSupport.rst

   1 .. raw:: html
   2
   3   <style type="text/css">
   4     .none { background-color: #FFCCCC }
   5     .part { background-color: #FFFF99 }
   6     .good { background-color: #CCFF99 }
   7   </style>
   8
   9 .. role:: none
  10 .. role:: part
  11 .. role:: good
  12
  13 .. contents::
  14    :local:
  15
  16 =============
  17 HIP Support
  18 =============
  19
  20 HIP (Heterogeneous-Compute Interface for Portability) `<https://github.com/ROCm-Developer-Tools/HIP>`_ is
  21 a C++ Runtime API and Kernel Language. It enables developers to create portable applications for
  22 offloading computation to different hardware platforms from a single source code.
  23
  24 AMD GPU Support
  25 ===============
  26
  27 Clang provides HIP support on AMD GPUs via the ROCm platform `<https://rocm.docs.amd.com/en/latest/#>`_.
  28 The ROCm runtime forms the base for HIP host APIs, while HIP device APIs are realized through HIP header
  29 files and the ROCm device library. The Clang driver uses the HIPAMD toolchain to compile HIP device code
  30 to AMDGPU ISA via the AMDGPU backend, or SPIR-V via the workflow outlined below.
  31 The compiled code is then bundled and embedded in the host executables.
  32
  33 Intel GPU Support
  34 =================
  35
  36 Clang provides partial HIP support on Intel GPUs using the CHIP-Star project `<https://github.com/CHIP-SPV/chipStar>`_.
  37 CHIP-Star implements the HIP runtime over oneAPI Level Zero or OpenCL runtime. The Clang driver uses the HIPSPV
  38 toolchain to compile HIP device code into LLVM IR, which is subsequently translated to SPIR-V via the SPIR-V
  39 backend or the out-of-tree LLVM-SPIRV translator. The SPIR-V is then bundled and embedded into the host executables.
  40
  41 .. note::
  42    While Clang does not directly provide HIP support for NVIDIA GPUs and CPUs, these platforms are supported via other means:
  43
  44    - NVIDIA GPUs: HIP support is offered through the HIP project `<https://github.com/ROCm-Developer-Tools/HIP>`_, which provides a header-only library for translating HIP runtime APIs into CUDA runtime APIs. The code is subsequently compiled using NVIDIA's `nvcc`.
  45
  46    - CPUs: HIP support is available through the HIP-CPU runtime library `<https://github.com/ROCm-Developer-Tools/HIP-CPU>`_. This header-only library enables CPUs to execute unmodified HIP code.
  47
  48
  49 Example Usage
  50 =============
  51
  52 To compile a HIP program, use the following command:
  53
  54 .. code-block:: shell
  55
  56    clang++ -c --offload-arch=gfx906 -xhip sample.cpp -o sample.o
  57
  58 The ``-xhip`` option indicates that the source is a HIP program. If the file has a ``.hip`` extension,
  59 Clang will automatically recognize it as a HIP program:
  60
  61 .. code-block:: shell
  62
  63    clang++ -c --offload-arch=gfx906 sample.hip -o sample.o
  64
  65 To link a HIP program, use this command:
  66
  67 .. code-block:: shell
  68
  69    clang++ --hip-link --offload-arch=gfx906 sample.o -o sample
  70
  71 In the above command, the ``--hip-link`` flag instructs Clang to link the HIP runtime library. However,
  72 the use of this flag is unnecessary if a HIP input file is already present in your program.
  73
  74 For convenience, Clang also supports compiling and linking in a single step:
  75
  76 .. code-block:: shell
  77
  78    clang++ --offload-arch=gfx906 -xhip sample.cpp -o sample
  79
  80 In the above commands, ``gfx906`` is the GPU architecture that the code is being compiled for. The supported GPU
  81 architectures can be found in the `AMDGPU Processor Table <https://llvm.org/docs/AMDGPUUsage.html#processors>`_.
  82 Alternatively, you can use the ``amdgpu-arch`` tool that comes with Clang to list the GPU architecture on your system:
  83
  84 .. code-block:: shell
  85
  86    amdgpu-arch
  87
  88 You can use ``--offload-arch=native`` to automatically detect the GPU architectures on your system:
  89
  90 .. code-block:: shell
  91
  92    clang++ --offload-arch=native -xhip sample.cpp -o sample
  93
  94
  95 Path Setting for Dependencies
  96 =============================
  97
  98 Compiling a HIP program depends on the HIP runtime and device library. The paths to the HIP runtime and device libraries
  99 can be specified either using compiler options or environment variables. The paths can also be set through the ROCm path
 100 if they follow the ROCm installation directory structure.
 101
 102 Order of Precedence for HIP Path
 103 --------------------------------
 104
 105 1. ``--hip-path`` compiler option
 106 2. ``HIP_PATH`` environment variable *(use with caution)*
 107 3. ``--rocm-path`` compiler option
 108 4. ``ROCM_PATH`` environment variable *(use with caution)*
 109 5. Default automatic detection (relative to Clang or at the default ROCm installation location)
 110
 111 Order of Precedence for Device Library Path
 112 -------------------------------------------
 113
 114 1. ``--hip-device-lib-path`` compiler option
 115 2. ``HIP_DEVICE_LIB_PATH`` environment variable *(use with caution)*
 116 3. ``--rocm-path`` compiler option
 117 4. ``ROCM_PATH`` environment variable *(use with caution)*
 118 5. Default automatic detection (relative to Clang or at the default ROCm installation location)
 119
 120 .. list-table::
 121    :header-rows: 1
 122
 123    * - Compiler Option
 124      - Environment Variable
 125      - Description
 126      - Default Value
 127    * - ``--rocm-path=<path>``
 128      - ``ROCM_PATH``
 129      - Specifies the ROCm installation path.
 130      - Automatic detection
 131    * - ``--hip-path=<path>``
 132      - ``HIP_PATH``
 133      - Specifies the HIP runtime installation path.
 134      - Determined by ROCm directory structure
 135    * - ``--hip-device-lib-path=<path>``
 136      - ``HIP_DEVICE_LIB_PATH``
 137      - Specifies the HIP device library installation path.
 138      - Determined by ROCm directory structure
 139
 140 .. note::
 141
 142    We recommend using the compiler options as the primary method for specifying these paths. While the environment variables ``ROCM_PATH``, ``HIP_PATH``, and ``HIP_DEVICE_LIB_PATH`` are supported, their use can lead to implicit dependencies that might cause issues in the long run. Use them with caution.
 143
 144
 145 Predefined Macros
 146 =================
 147
 148 .. list-table::
 149    :header-rows: 1
 150
 151    * - Macro
 152      - Description
 153    * - ``__CLANG_RDC__``
 154      - Defined when Clang is compiling code in Relocatable Device Code (RDC) mode. RDC, enabled with the ``-fgpu-rdc`` compiler option, is necessary for linking device codes across translation units.
 155    * - ``__HIP__``
 156      - Defined when compiling with HIP language support, indicating that the code targets the HIP environment.
 157    * - ``__HIPCC__``
 158      - Alias to ``__HIP__``.
 159    * - ``__HIP_DEVICE_COMPILE__``
 160      - Defined during device code compilation in Clang's separate compilation process for the host and each offloading GPU architecture.
 161    * - ``__HIP_MEMORY_SCOPE_SINGLETHREAD``
 162      - Represents single-thread memory scope in HIP (value is 1).
 163    * - ``__HIP_MEMORY_SCOPE_WAVEFRONT``
 164      - Represents wavefront memory scope in HIP (value is 2).
 165    * - ``__HIP_MEMORY_SCOPE_WORKGROUP``
 166      - Represents workgroup memory scope in HIP (value is 3).
 167    * - ``__HIP_MEMORY_SCOPE_AGENT``
 168      - Represents agent memory scope in HIP (value is 4).
 169    * - ``__HIP_MEMORY_SCOPE_SYSTEM``
 170      - Represents system-wide memory scope in HIP (value is 5).
 171    * - ``__HIP_NO_IMAGE_SUPPORT__``
 172      - Defined with a value of 1 when the target device lacks support for HIP image functions.
 173    * - ``__HIP_NO_IMAGE_SUPPORT``
 174      - Alias to ``__HIP_NO_IMAGE_SUPPORT__``. Deprecated.
 175    * - ``__HIP_API_PER_THREAD_DEFAULT_STREAM__``
 176      - Defined when the GPU default stream is set to per-thread mode.
 177    * - ``HIP_API_PER_THREAD_DEFAULT_STREAM``
 178      - Alias to ``__HIP_API_PER_THREAD_DEFAULT_STREAM__``. Deprecated.
 179
 180 Note that some architecture specific AMDGPU macros will have default values when
 181 used from the HIP host compilation. Other :doc:`AMDGPU macros <AMDGPUSupport>`
 182 like ``__AMDGCN_WAVEFRONT_SIZE__`` (deprecated) will default to 64 for example.
 183
 184 Compilation Modes
 185 =================
 186
 187 Each HIP source file contains intertwined device and host code. Depending on the chosen compilation mode by the compiler options ``-fno-gpu-rdc`` and ``-fgpu-rdc``, these portions of code are compiled differently.
 188
 189 Device Code Compilation
 190 -----------------------
 191
 192 **``-fno-gpu-rdc`` Mode (default)**:
 193
 194 - Compiles to a self-contained, fully linked offloading device binary for each offloading device architecture.
 195 - Device code within a Translation Unit (TU) cannot call functions located in another TU.
 196
 197 **``-fgpu-rdc`` Mode**:
 198
 199 - Compiles to a bitcode for each GPU architecture.
 200 - For each offloading device architecture, the bitcode from different TUs are linked together to create a single offloading device binary.
 201 - Device code in one TU can call functions located in another TU.
 202
 203 Host Code Compilation
 204 ---------------------
 205
 206 **Both Modes**:
 207
 208 - Compiles to a relocatable object for each TU.
 209 - These relocatable objects are then linked together.
 210 - Host code within a TU can call host functions and launch kernels from another TU.
 211
 212 Syntax Difference with CUDA
 213 ===========================
 214
 215 Clang's front end, used for both CUDA and HIP programming models, shares the same parsing and semantic analysis mechanisms. This includes the resolution of overloads concerning device and host functions. While there exists a comprehensive documentation on the syntax differences between Clang and NVCC for CUDA at `Dialect Differences Between Clang and NVCC <https://llvm.org/docs/CompileCudaWithLLVM.html#dialect-differences-between-clang-and-nvcc>`_, it is important to note that these differences also apply to HIP code compilation.
 216
 217 Predefined Macros for Differentiation
 218 -------------------------------------
 219
 220 To facilitate differentiation between HIP and CUDA code, as well as between device and host compilations within HIP, Clang defines specific macros:
 221
 222 - ``__HIP__`` : This macro is defined only when compiling HIP code. It can be used to conditionally compile code specific to HIP, enabling developers to write portable code that can be compiled for both CUDA and HIP.
 223
 224 - ``__HIP_DEVICE_COMPILE__`` : Defined exclusively during HIP device compilation, this macro allows for conditional compilation of device-specific code. It provides a mechanism to segregate device and host code, ensuring that each can be optimized for their respective execution environments.
 225
 226 Function Pointers Support
 227 =========================
 228
 229 Function pointers' support varies with the usage mode in Clang with HIP. The following table provides an overview of the support status across different use-cases and modes.
 230
 231 .. list-table:: Function Pointers Support Overview
 232    :widths: 25 25 25
 233    :header-rows: 1
 234
 235    * - Use Case
 236      - ``-fno-gpu-rdc`` Mode (default)
 237      - ``-fgpu-rdc`` Mode
 238    * - Defined and used in the same TU
 239      - Supported
 240      - Supported
 241    * - Defined in one TU and used in another TU
 242      - Not Supported
 243      - Supported
 244
 245 In the ``-fno-gpu-rdc`` mode, the compiler calculates the resource usage of kernels based only on functions present within the same TU. This mode does not support the use of function pointers defined in a different TU due to the possibility of incorrect resource usage calculations, leading to undefined behavior.
 246
 247 On the other hand, the ``-fgpu-rdc`` mode allows the definition and use of function pointers across different TUs, as resource usage calculations can accommodate functions from disparate TUs.
 248
 249 Virtual Function Support
 250 ========================
 251
 252 In Clang with HIP, support for calling virtual functions of an object in device or host code is contingent on where the object is constructed.
 253
 254 - **Constructed in Device Code**: Virtual functions of an object can be called in device code on a specific offloading device if the object is constructed in device code on an offloading device with the same architecture.
 255 - **Constructed in Host Code**: Virtual functions of an object can be called in host code if the object is constructed in host code.
 256
 257 In other scenarios, calling virtual functions is not allowed.
 258
 259 Explanation
 260 -----------
 261
 262 An object constructed on the device side contains a pointer to the virtual function table on the device side, which is not accessible in host code, and vice versa. Thus, trying to invoke virtual functions from a context different from where the object was constructed will be disallowed because the appropriate virtual table cannot be accessed. The virtual function tables for offloading devices with different architecures are different, therefore trying to invoke virtual functions from an offloading device with a different architecture than where the object is constructed is also disallowed.
 263
 264 Example Usage
 265 -------------
 266
 267 .. code-block:: c++
 268
 269    class Base {
 270    public:
 271       __device__ virtual void virtualFunction() {
 272          // Base virtual function implementation
 273       }
 274    };
 275
 276    class Derived : public Base {
 277    public:
 278       __device__ void virtualFunction() override {
 279          // Derived virtual function implementation
 280       }
 281    };
 282
 283    __global__ void kernel() {
 284       Derived obj;
 285       Base* basePtr = &obj;
 286       basePtr->virtualFunction(); // Allowed since obj is constructed in device code
 287    }
 288
 289 C++ Standard Parallelism Offload Support: Compiler And Runtime
 290 ==============================================================
 291
 292 Introduction
 293 ============
 294
 295 This section describes the implementation of support for offloading the
 296 execution of standard C++ algorithms to accelerators that can be targeted via
 297 HIP. Furthermore, it enumerates restrictions on user defined code, as well as
 298 the interactions with runtimes.
 299
 300 Algorithm Offload: What, Why, Where
 301 ===================================
 302
 303 C++17 introduced overloads
 304 `for most algorithms in the standard library <https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2016/p0024r2.html>`_
 305 which allow the user to specify a desired
 306 `execution policy <https://en.cppreference.com/w/cpp/algorithm#Execution_policies>`_.
 307 The `parallel_unsequenced_policy <https://en.cppreference.com/w/cpp/algorithm/execution_policy_tag_t>`_
 308 maps relatively well to the execution model of AMD GPUs. This, coupled with the
 309 the availability and maturity of GPU accelerated algorithm libraries that
 310 implement most / all corresponding algorithms in the standard library
 311 (e.g. `rocThrust <https://github.com/ROCmSoftwarePlatform/rocThrust>`__), makes
 312 it feasible to provide seamless accelerator offload for supported algorithms,
 313 when an accelerated version exists. Thus, it becomes possible to easily access
 314 the computational resources of an AMD accelerator, via a well specified,
 315 familiar, algorithmic interface, without having to delve into low-level hardware
 316 specific details. Putting it all together:
 317
 318 - **What**: standard library algorithms, when invoked with the
 319   ``parallel_unsequenced_policy``
 320 - **Why**: democratise AMDGPU accelerator programming, without loss of user
 321   familiarity
 322 - **Where**: only AMDGPU accelerators targeted by Clang/LLVM via HIP
 323
 324 Small Example
 325 =============
 326
 327 Given the following C++ code:
 328
 329 .. code-block:: C++
 330
 331    bool has_the_answer(const std::vector<int>& v) {
 332      return std::find(std::execution::par_unseq, std::cbegin(v), std::cend(v), 42) != std::cend(v);
 333    }
 334
 335 if Clang is invoked with the ``--hipstdpar --offload-arch=foo`` flags, the call
 336 to ``find`` will be offloaded to an accelerator that is part of the ``foo``
 337 target family. If either ``foo`` or its runtime environment do not support
 338 transparent on-demand paging (such as e.g. that provided in Linux via
 339 `HMM <https://docs.kernel.org/mm/hmm.html>`_), it is necessary to also include
 340 the ``--hipstdpar-interpose-alloc`` flag. If the accelerator specific algorithm
 341 library ``foo`` uses doesn't have an implementation of a particular algorithm,
 342 execution seamlessly falls back to the host CPU. It is legal to specify multiple
 343 ``--offload-arch``\s. All the flags we introduce, as well as a thorough view of
 344 various restrictions an their implementations, will be provided below.
 345
 346 Implementation - General View
 347 =============================
 348
 349 We built support for Algorithm Offload support atop the pre-existing HIP
 350 infrastructure. More specifically, when one requests offload via ``--hipstdpar``,
 351 compilation is switched to HIP compilation, as if ``-x hip`` was specified.
 352 Similarly, linking is also switched to HIP linking, as if ``--hip-link`` was
 353 specified. Note that these are implicit, and one should not assume that any
 354 interop with HIP specific language constructs is available e.g. ``__device__``
 355 annotations are neither necessary nor guaranteed to work.
 356
 357 Since there are no language restriction mechanisms in place, it is necessary to
 358 relax HIP language specific semantic checks performed by the FE; they would
 359 identify otherwise valid, offloadable code, as invalid HIP code. Given that we
 360 know that the user intended only for certain algorithms to be offloaded, and
 361 encoded this by specifying the ``parallel_unsequenced_policy``, we rely on a
 362 pass over IR to clean up any and all code that was not "meant" for offload. If
 363 requested, allocation interposition is also handled via a separate pass over IR.
 364
 365 To interface with the client HIP runtime, and to forward offloaded algorithm
 366 invocations to the corresponding accelerator specific library implementation, an
 367 implementation detail forwarding header is implicitly included by the driver,
 368 when compiling with ``--hipstdpar``. In what follows, we will delve into each
 369 component that contributes to implementing Algorithm Offload support.
 370
 371 Implementation - Driver
 372 =======================
 373
 374 We augment the ``clang`` driver with the following flags:
 375
 376 - ``--hipstdpar`` enables algorithm offload, which depending on phase, has the
 377   following effects:
 378
 379   - when compiling:
 380
 381     - ``-x hip`` gets prepended to enable HIP support;
 382     - the ``ROCmToolchain`` component checks for the ``hipstdpar_lib.hpp``
 383       forwarding header,
 384       `rocThrust <https://rocm.docs.amd.com/projects/rocThrust/en/latest/>`_ and
 385       `rocPrim <https://rocm.docs.amd.com/projects/rocPRIM/en/latest/>`_ in
 386       their canonical locations, which can be overriden via flags found below;
 387       if all are found, the forwarding header gets implicitly included,
 388       otherwise an error listing the missing component is generated;
 389     - the ``LangOpts.HIPStdPar`` member is set.
 390
 391   - when linking:
 392
 393     - ``--hip-link`` and ``-frtlib-add-rpath`` gets appended to enable HIP
 394       support.
 395
 396 - ``--hipstdpar-interpose-alloc`` enables the interposition of standard
 397   allocation / deallocation functions with accelerator aware equivalents; the
 398   ``LangOpts.HIPStdParInterposeAlloc`` member is set;
 399 - ``--hipstdpar-path=`` specifies a non-canonical path for the forwarding
 400   header; it must point to the folder where the header is located and not to the
 401   header itself;
 402 - ``--hipstdpar-thrust-path=`` specifies a non-canonical path for
 403   `rocThrust <https://rocm.docs.amd.com/projects/rocThrust/en/latest/>`_; it
 404   must point to the folder where the library is installed / built under a
 405   ``/thrust`` subfolder;
 406 - ``--hipstdpar-prim-path=`` specifies a non-canonical path for
 407   `rocPrim <https://rocm.docs.amd.com/projects/rocPRIM/en/latest/>`_; it must
 408   point to the folder where the library is installed / built under a
 409   ``/rocprim`` subfolder;
 410
 411 The `--offload-arch <https://llvm.org/docs/AMDGPUUsage.html#amdgpu-processors>`_
 412 flag can be used to specify the accelerator for which offload code is to be
 413 generated.
 414
 415 Implementation - Front-End
 416 ==========================
 417
 418 When ``LangOpts.HIPStdPar`` is set, we relax some of the HIP language specific
 419 ``Sema`` checks to account for the fact that we want to consume pure unannotated
 420 C++ code:
 421
 422 1. ``__device__`` / ``__host__ __device__`` functions (which would originate in
 423    the accelerator specific algorithm library) are allowed to call implicitly
 424    ``__host__`` functions;
 425 2. ``__global__`` functions (which would originate in the accelerator specific
 426    algorithm library) are allowed to call implicitly ``__host__`` functions;
 427 3. resolving ``__builtin`` availability is deferred, because it is possible that
 428    a ``__builtin`` that is unavailable on the target accelerator is not
 429    reachable from any offloaded algorithm, and thus will be safely removed in
 430    the middle-end;
 431 4. ASM parsing / checking is deferred, because it is possible that an ASM block
 432    that e.g. uses some constraints that are incompatible with the target
 433    accelerator is not reachable from any offloaded algorithm, and thus will be
 434    safely removed in the middle-end.
 435
 436 ``CodeGen`` is similarly relaxed, with implicitly ``__host__`` functions being
 437 emitted as well.
 438
 439 Implementation - Middle-End
 440 ===========================
 441
 442 We add two ``opt`` passes:
 443
 444 1. ``HipStdParAcceleratorCodeSelectionPass``
 445
 446    - For all kernels in a ``Module``, compute reachability, where a function
 447      ``F`` is reachable from a kernel ``K`` if and only if there exists a direct
 448      call-chain rooted in ``F`` that includes ``K``;
 449    - Remove all functions that are not reachable from kernels;
 450    - This pass is only run when compiling for the accelerator.
 451
 452 The first pass assumes that the only code that the user intended to offload was
 453 that which was directly or transitively invocable as part of an algorithm
 454 execution. It also assumes that an accelerator aware algorithm implementation
 455 would rely on accelerator specific special functions (kernels), and that these
 456 effectively constitute the only roots for accelerator execution graphs. Both of
 457 these assumptions are based on observing how widespread accelerators,
 458 such as GPUs, work.
 459
 460 1. ``HipStdParAllocationInterpositionPass``
 461
 462    - Iterate through all functions in a ``Module``, and replace standard
 463      allocation / deallocation functions with accelerator-aware equivalents,
 464      based on a pre-established table; the list of functions that can be
 465      interposed is available
 466      `here <https://github.com/ROCmSoftwarePlatform/roc-stdpar#allocation--deallocation-interposition-status>`__;
 467    - This is only run when compiling for the host.
 468
 469 The second pass is optional.
 470
 471 Implementation - Forwarding Header
 472 ==================================
 473
 474 The forwarding header implements two pieces of functionality:
 475
 476 1. It forwards algorithms to a target accelerator, which is done by relying on
 477    C++ language rules around overloading:
 478
 479    - overloads taking an explicit argument of type
 480      ``parallel_unsequenced_policy`` are introduced into the ``std`` namespace;
 481    - these will get preferentially selected versus the master template;
 482    - the body forwards to the equivalent algorithm from the accelerator specific
 483      library
 484
 485 2. It provides allocation / deallocation functions that are equivalent to the
 486    standard ones, but obtain memory by invoking
 487    `hipMallocManaged <https://rocm.docs.amd.com/projects/HIP/en/latest/.doxygen/docBin/html/group___memory_m.html#gab8cfa0e292193fa37e0cc2e4911fa90a>`_
 488    and release it via `hipFree <https://rocm.docs.amd.com/projects/HIP/en/latest/.doxygen/docBin/html/group___memory.html#ga740d08da65cae1441ba32f8fedb863d1>`_.
 489
 490 Predefined Macros
 491 =================
 492
 493 .. list-table::
 494    :header-rows: 1
 495
 496    * - Macro
 497      - Description
 498    * - ``__HIPSTDPAR__``
 499      - Defined when Clang is compiling code in algorithm offload mode, enabled
 500        with the ``--hipstdpar`` compiler option.
 501    * - ``__HIPSTDPAR_INTERPOSE_ALLOC__``
 502      - Defined only when compiling in algorithm offload mode, when the user
 503        enables interposition mode with the ``--hipstdpar-interpose-alloc``
 504        compiler option, indicating that all dynamic memory allocation /
 505        deallocation functions should be replaced with accelerator aware
 506        variants.
 507
 508 Restrictions
 509 ============
 510
 511 We define two modes in which runtime execution can occur:
 512
 513 1. **HMM Mode** - this assumes that the
 514    `HMM <https://docs.kernel.org/mm/hmm.html>`_ subsystem of the Linux kernel
 515    is used to provide transparent on-demand paging i.e. memory obtained from a
 516    system / OS allocator such as via a call to ``malloc`` or ``operator new`` is
 517    directly accessible to the accelerator and it follows the C++ memory model;
 518 2. **Interposition Mode** - this is a fallback mode for cases where transparent
 519    on-demand paging is unavailable (e.g. in the Windows OS), which means that
 520    memory must be allocated via an accelerator aware mechanism, and system
 521    allocated memory is inaccessible for the accelerator.
 522
 523 The following restrictions imposed on user code apply to both modes:
 524
 525 1. Pointers to function, and all associated features, such as e.g. dynamic
 526    polymorphism, cannot be used (directly or transitively) by the user provided
 527    callable passed to an algorithm invocation;
 528 2. Global / namespace scope / ``static`` / ``thread`` storage duration variables
 529    cannot be used (directly or transitively) in name by the user provided
 530    callable;
 531
 532    - When executing in **HMM Mode** they can be used in address e.g.:
 533
 534      .. code-block:: C++
 535
 536         namespace { int foo = 42; }
 537
 538         bool never(const std::vector<int>& v) {
 539           return std::any_of(std::execution::par_unseq, std::cbegin(v), std::cend(v), [](auto&& x) {
 540             return x == foo;
 541           });
 542         }
 543
 544         bool only_in_hmm_mode(const std::vector<int>& v) {
 545           return std::any_of(std::execution::par_unseq, std::cbegin(v), std::cend(v),
 546                              [p = &foo](auto&& x) { return x == *p; });
 547         }
 548
 549 3. Only algorithms that are invoked with the ``parallel_unsequenced_policy`` are
 550    candidates for offload;
 551 4. Only algorithms that are invoked with iterator arguments that model
 552    `random_access_iterator <https://en.cppreference.com/w/cpp/iterator/random_access_iterator>`_
 553    are candidates for offload;
 554 5. `Exceptions <https://en.cppreference.com/w/cpp/language/exceptions>`_ cannot
 555    be used by the user provided callable;
 556 6. Dynamic memory allocation (e.g. ``operator new``) cannot be used by the user
 557    provided callable;
 558 7. Selective offload is not possible i.e. it is not possible to indicate that
 559    only some algorithms invoked with the ``parallel_unsequenced_policy`` are to
 560    be executed on the accelerator.
 561
 562 In addition to the above, using **Interposition Mode** imposes the following
 563 additional restrictions:
 564
 565 1. All code that is expected to interoperate has to be recompiled with the
 566    ``--hipstdpar-interpose-alloc`` flag i.e. it is not safe to compose libraries
 567    that have been independently compiled;
 568 2. automatic storage duration (i.e. stack allocated) variables cannot be used
 569    (directly or transitively) by the user provided callable e.g.
 570
 571    .. code-block:: c++
 572
 573       bool never(const std::vector<int>& v, int n) {
 574         return std::any_of(std::execution::par_unseq, std::cbegin(v), std::cend(v),
 575                            [p = &n](auto&& x) { return x == *p; });
 576       }
 577
 578 Current Support
 579 ===============
 580
 581 At the moment, C++ Standard Parallelism Offload is only available for AMD GPUs,
 582 when the `ROCm <https://rocm.docs.amd.com/en/latest/>`_ stack is used, on the
 583 Linux operating system. Support is synthesised in the following table:
 584
 585 .. list-table::
 586    :header-rows: 1
 587
 588    * - `Processor <https://llvm.org/docs/AMDGPUUsage.html#amdgpu-processors>`_
 589      - HMM Mode
 590      - Interposition Mode
 591    * - GCN GFX9 (Vega)
 592      - YES
 593      - YES
 594    * - GCN GFX10.1 (RDNA 1)
 595      - *NO*
 596      - YES
 597    * - GCN GFX10.3 (RDNA 2)
 598      - *NO*
 599      - YES
 600    * - GCN GFX11 (RDNA 3)
 601      - *NO*
 602      - YES
 603    * - GCN GFX12 (RDNA 4)
 604      - *NO*
 605      - YES
 606
 607 The minimum Linux kernel version for running in HMM mode is 6.4.
 608
 609 The forwarding header can be obtained from
 610 `its GitHub repository <https://github.com/ROCmSoftwarePlatform/roc-stdpar>`_.
 611 It will be packaged with a future `ROCm <https://rocm.docs.amd.com/en/latest/>`_
 612 release. Because accelerated algorithms are provided via
 613 `rocThrust <https://rocm.docs.amd.com/projects/rocThrust/en/latest/>`_, a
 614 transitive dependency on
 615 `rocPrim <https://rocm.docs.amd.com/projects/rocPRIM/en/latest/>`_ exists. Both
 616 can be obtained either by installing their associated components of the
 617 `ROCm <https://rocm.docs.amd.com/en/latest/>`_ stack, or from their respective
 618 repositories. The list algorithms that can be offloaded is available
 619 `here <https://github.com/ROCmSoftwarePlatform/roc-stdpar#algorithm-support-status>`_.
 620
 621 HIP Specific Elements
 622 ---------------------
 623
 624 1. There is no defined interop with the
 625    `HIP kernel language <https://rocm.docs.amd.com/projects/HIP/en/latest/reference/kernel_language.html>`_;
 626    whilst things like using `__device__` annotations might accidentally "work",
 627    they are not guaranteed to, and thus cannot be relied upon by user code;
 628
 629    - A consequence of the above is that both bitcode linking and linking
 630      relocatable object files will "work", but it is not guaranteed to remain
 631      working or actively tested at the moment; this restriction might be relaxed
 632      in the future.
 633
 634 2. Combining explicit HIP, CUDA or OpenMP Offload compilation with
 635    ``--hipstdpar`` based offloading is not allowed or supported in any way.
 636 3. There is no way to target different accelerators via a standard algorithm
 637    invocation (`this might be addressed in future C++ standards <https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2023/p2500r1.html>`_);
 638    an unsafe (per the point above) way of achieving this is to spawn new threads
 639    and invoke the `hipSetDevice <https://rocm.docs.amd.com/projects/HIP/en/latest/.doxygen/docBin/html/group___device.html#ga43c1e7f15925eeb762195ccb5e063eae>`_
 640    interface e.g.:
 641
 642    .. code-block:: c++
 643
 644       int accelerator_0 = ...;
 645       int accelerator_1 = ...;
 646
 647       bool multiple_accelerators(const std::vector<int>& u, const std::vector<int>& v) {
 648         std::atomic<unsigned int> r{0u};
 649
 650         thread t0{[&]() {
 651           hipSetDevice(accelerator_0);
 652
 653           r += std::count(std::execution::par_unseq, std::cbegin(u), std::cend(u), 42);
 654         }};
 655         thread t1{[&]() {
 656           hitSetDevice(accelerator_1);
 657
 658           r += std::count(std::execution::par_unseq, std::cbegin(v), std::cend(v), 314152)
 659         }};
 660
 661         t0.join();
 662         t1.join();
 663
 664         return r;
 665       }
 666
 667    Note that this is a temporary, unsafe workaround for a deficiency in the C++
 668    Standard.
 669
 670 Open Questions / Future Developments
 671 ====================================
 672
 673 1. The restriction on the use of global / namespace scope / ``static`` /
 674    ``thread`` storage duration variables in offloaded algorithms will be lifted
 675    in the future, when running in **HMM Mode**;
 676 2. The restriction on the use of dynamic memory allocation in offloaded
 677    algorithms will be lifted in the future.
 678 3. The restriction on the use of pointers to function, and associated features
 679    such as dynamic polymorphism might be lifted in the future, when running in
 680    **HMM Mode**;
 681 4. Offload support might be extended to cases where the ``parallel_policy`` is
 682    used for some or all targets.
 683
 684 SPIR-V Support on HIPAMD ToolChain
 685 ==================================
 686
 687 The HIPAMD ToolChain supports targetting
 688 `AMDGCN Flavoured SPIR-V <https://llvm.org/docs/SPIRVUsage.html#target-triples>`_.
 689 The support for SPIR-V in the ROCm and HIPAMD ToolChain is under active
 690 development.
 691
 692 Compilation Process
 693 -------------------
 694
 695 When compiling HIP programs with the intent of utilizing SPIR-V, the process
 696 diverges from the traditional compilation flow:
 697
 698 Using ``--offload-arch=amdgcnspirv``
 699 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 700
 701 - **Target Triple**: The ``--offload-arch=amdgcnspirv`` flag instructs the
 702   compiler to use the target triple ``spirv64-amd-amdhsa``. This approach does
 703   generates generic AMDGCN SPIR-V which retains architecture specific elements
 704   without hardcoding them, thus allowing for optimal target specific code to be
 705   generated at run time, when the concrete target is known.
 706
 707 - **LLVM IR Translation**: The program is compiled to LLVM Intermediate
 708   Representation (IR), which is subsequently translated into SPIR-V. In the
 709   future, this translation step will be replaced by direct SPIR-V emission via
 710   the SPIR-V Back-end.
 711
 712 - **Clang Offload Bundler**: The resulting SPIR-V is embedded in the Clang
 713   offload bundler with the bundle ID ``hip-spirv64-amd-amdhsa--amdgcnspirv``.
 714
 715 Architecture Specific Macros
 716 ----------------------------
 717
 718 None of the architecture specific :doc:`AMDGPU macros <AMDGPUSupport>` are
 719 defined when targeting SPIR-V. An alternative, more flexible mechanism to enable
 720 doing per target / per feature code selection will be added in the future.