libc/docs/gpu/using.rst

   1 .. _libc_gpu_usage:
   2
   3 ===================
   4 Using libc for GPUs
   5 ===================
   6
   7 .. contents:: Table of Contents
   8   :depth: 4
   9   :local:
  10
  11 Using the GPU C library
  12 =======================
  13
  14 Once you have finished :ref:`building<libc_gpu_building>` the GPU C library it
  15 can be used to run libc or libm functions directly on the GPU. Currently, not
  16 all C standard functions are supported on the GPU. Consult the :ref:`list of
  17 supported functions<libc_gpu_support>` for a comprehensive list.
  18
  19 The GPU C library supports two main usage modes. The first is as a supplementary
  20 library for offloading languages such as OpenMP, CUDA, or HIP. These aim to
  21 provide standard system utilities similarly to existing vendor libraries. The
  22 second method treats the GPU as a hosted target by compiling C or C++ for it
  23 directly. This is more similar to targeting OpenCL and is primarily used for
  24 exported functions on the GPU and testing.
  25
  26 Offloading usage
  27 ----------------
  28
  29 Offloading languages like CUDA, HIP, or OpenMP work by compiling a single source
  30 file for both the host target and a list of offloading devices. In order to
  31 support standard compilation flows, the ``clang`` driver uses fat binaries,
  32 described in the `clang documentation
  33 <https://clang.llvm.org/docs/OffloadingDesign.html>`_. This linking mode is used
  34 by the OpenMP toolchain, but is currently opt-in for the CUDA and HIP toolchains
  35 through the ``--offload-new-driver``` and ``-fgpu-rdc`` flags.
  36
  37 In order or link the GPU runtime, we simply pass this library to the embedded
  38 device linker job. This can be done using the ``-Xoffload-linker`` option, which
  39 forwards an argument to a ``clang`` job used to create the final GPU executable.
  40 The toolchain should pick up the C libraries automatically in most cases, so
  41 this shouldn't be necessary.
  42
  43 .. code-block:: sh
  44
  45   $> clang openmp.c -fopenmp --offload-arch=gfx90a -Xoffload-linker -lc
  46   $> clang cuda.cu --offload-arch=sm_80 --offload-new-driver -fgpu-rdc -Xoffload-linker -lc
  47   $> clang hip.hip --offload-arch=gfx940 --offload-new-driver -fgpu-rdc -Xoffload-linker -lc
  48
  49 This will automatically link in the needed function definitions if they were
  50 required by the user's application. Normally using the ``-fgpu-rdc`` option
  51 results in sub-par performance due to ABA linking. However, the offloading
  52 toolchain supports the ``--foffload-lto`` option to support LTO on the target
  53 device.
  54
  55 Offloading languages require that functions present on the device be declared as
  56 such. This is done with the ``__device__`` keyword in CUDA and HIP or the
  57 ``declare target`` pragma in OpenMP. This requires that the LLVM C library
  58 exposes its implemented functions to the compiler when it is used to build. We
  59 support this by providing wrapper headers in the compiler's resource directory.
  60 These are located in ``<clang-resource-dir>/include/llvm-libc-wrappers`` in your
  61 installation.
  62
  63 The support for HIP and CUDA is more experimental, requiring manual intervention
  64 to link and use the facilities. An example of this is shown in the :ref:`CUDA
  65 server example<libc_gpu_cuda_server>`. The OpenMP Offloading toolchain is
  66 completely integrated with the LLVM C library however. It will automatically
  67 handle including the necessary libraries, define device-side interfaces, and run
  68 the RPC server.
  69
  70 OpenMP Offloading example
  71 ^^^^^^^^^^^^^^^^^^^^^^^^^
  72
  73 This section provides a simple example of compiling an OpenMP program with the
  74 GPU C library.
  75
  76 .. code-block:: c++
  77
  78   #include <stdio.h>
  79
  80   int main() {
  81     FILE *file = stderr;
  82   #pragma omp target teams num_teams(2) thread_limit(2)
  83   #pragma omp parallel num_threads(2)
  84     { fputs("Hello from OpenMP!\n", file); }
  85   }
  86
  87 This can simply be compiled like any other OpenMP application to print from two
  88 threads and two blocks.
  89
  90 .. code-block:: sh
  91
  92   $> clang openmp.c -fopenmp --offload-arch=gfx90a
  93   $> ./a.out
  94   Hello from OpenMP!
  95   Hello from OpenMP!
  96   Hello from OpenMP!
  97   Hello from OpenMP!
  98
  99 Including the wrapper headers, linking the C library, and running the :ref:`RPC
 100 server<libc_gpu_rpc>` are all handled automatically by the compiler and runtime.
 101
 102 Binary format
 103 ^^^^^^^^^^^^^
 104
 105 The ``libcgpu.a`` static archive is a fat-binary containing LLVM-IR for each
 106 supported target device. The supported architectures can be seen using LLVM's
 107 ``llvm-objdump`` with the ``--offloading`` flag:
 108
 109 .. code-block:: sh
 110
 111   $> llvm-objdump --offloading libcgpu-amdgpu.a
 112   libcgpu-amdgpu.a(strcmp.cpp.o):    file format elf64-x86-64
 113
 114   OFFLOADING IMAGE [0]:
 115   kind            llvm ir
 116   arch            generic
 117   triple          amdgcn-amd-amdhsa
 118   producer        none
 119   ...
 120
 121 Because the device code is stored inside a fat binary, it can be difficult to
 122 inspect the resulting code. This can be done using the following utilities:
 123
 124 .. code-block:: sh
 125
 126   $> llvm-ar x libcgpu.a strcmp.cpp.o
 127   $> clang-offload-packager strcmp.cpp.o --image=arch=generic,file=strcmp.bc
 128   $> opt -S out.bc
 129   ...
 130
 131 Please note that this fat binary format is provided for compatibility with
 132 existing offloading toolchains. The implementation in ``libc`` does not depend
 133 on any existing offloading languages and is completely freestanding.
 134
 135 Direct compilation
 136 ------------------
 137
 138 Instead of using standard offloading languages, we can also target the CPU
 139 directly using C and C++ to create a GPU executable similarly to OpenCL. This is
 140 done by targeting the GPU architecture using `clang's cross compilation
 141 support <https://clang.llvm.org/docs/CrossCompilation.html>`_. This is the
 142 method that the GPU C library uses both to build the library and to run tests.
 143
 144 This allows us to easily define GPU specific libraries and programs that fit
 145 well into existing tools. In order to target the GPU effectively we rely heavily
 146 on the compiler's intrinsic and built-in functions. For example, the following
 147 function gets the thread identifier in the 'x' dimension on both GPUs supported
 148 GPUs.
 149
 150 .. code-block:: c++
 151
 152   uint32_t get_thread_id_x() {
 153   #if defined(__AMDGPU__)
 154     return __builtin_amdgcn_workitem_id_x();
 155   #elif defined(__NVPTX__)
 156     return __nvvm_read_ptx_sreg_tid_x();
 157   #else
 158   #error "Unsupported platform"
 159   #endif
 160   }
 161
 162 We can then compile this for both NVPTX and AMDGPU into LLVM-IR using the
 163 following commands. This will yield valid LLVM-IR for the given target just like
 164 if we were using CUDA, OpenCL, or OpenMP.
 165
 166 .. code-block:: sh
 167
 168   $> clang id.c --target=amdgcn-amd-amdhsa -mcpu=native -nogpulib -flto -c
 169   $> clang id.c --target=nvptx64-nvidia-cuda -march=native -nogpulib -flto -c
 170
 171 We can also use this support to treat the GPU as a hosted environment by
 172 providing a C library and startup object just like a standard C library running
 173 on the host machine. Then, in order to execute these programs, we provide a
 174 loader utility to launch the executable on the GPU similar to a cross-compiling
 175 emulator. This is how we run :ref:`unit tests <libc_gpu_testing>` targeting the
 176 GPU. This is clearly not the most efficient way to use a GPU, but it provides a
 177 simple method to test execution on a GPU for debugging or development.
 178
 179 Building for AMDGPU targets
 180 ^^^^^^^^^^^^^^^^^^^^^^^^^^^
 181
 182 The AMDGPU target supports several features natively by virtue of using ``lld``
 183 as its linker. The installation will include the ``include/amdgcn-amd-amdhsa``
 184 and ``lib/amdgcn-amd-amdha`` directories that contain the necessary code to use
 185 the library. We can directly link against ``libc.a`` and use LTO to generate the
 186 final executable.
 187
 188 .. code-block:: c++
 189
 190   #include <stdio.h>
 191
 192   int main() { printf("Hello from AMDGPU!\n"); }
 193
 194 This program can then be compiled using the ``clang`` compiler. Note that
 195 ``-flto`` and ``-mcpu=`` should be defined. This is because the GPU
 196 sub-architectures do not have strict backwards compatibility. Use ``-mcpu=help``
 197 for accepted arguments or ``-mcpu=native`` to target the system's installed GPUs
 198 if present. Additionally, the AMDGPU target always uses ``-flto`` because we
 199 currently do not fully support ELF linking in ``lld``. Once built, we use the
 200 ``amdhsa-loader`` utility to launch execution on the GPU. This will be built if
 201 the ``hsa_runtime64`` library was found during build time.
 202
 203 .. code-block:: sh
 204
 205   $> clang hello.c --target=amdgcn-amd-amdhsa -mcpu=native -flto -lc <install>/lib/amdgcn-amd-amdhsa/crt1.o
 206   $> amdhsa-loader --threads 2 --blocks 2 a.out
 207   Hello from AMDGPU!
 208   Hello from AMDGPU!
 209   Hello from AMDGPU!
 210   Hello from AMDGPU!
 211
 212 This will include the ``stdio.h`` header, which is found in the
 213 ``include/amdgcn-amd-amdhsa`` directory. We define out ``main`` function like a
 214 standard application. The startup utility in ``lib/amdgcn-amd-amdhsa/crt1.o``
 215 will handle the necessary steps to execute the ``main`` function along with
 216 global initializers and command line arguments. Finally, we link in the
 217 ``libc.a`` library stored in ``lib/amdgcn-amd-amdhsa`` to define the standard C
 218 functions.
 219
 220 The search paths for the include directories and libraries are automatically
 221 handled by the compiler. We use this support internally to run unit tests on the
 222 GPU directly. See :ref:`libc_gpu_testing` for more information. The installation
 223 also provides ``libc.bc`` which is a single LLVM-IR bitcode blob that can be
 224 used instead of the static library.
 225
 226 Building for NVPTX targets
 227 ^^^^^^^^^^^^^^^^^^^^^^^^^^
 228
 229 The infrastructure is the same as the AMDGPU example. However, the NVPTX binary
 230 utilities are very limited and must be targeted directly. A utility called
 231 ``clang-nvlink-wrapper`` instead wraps around the standard link job to give the
 232 illusion that ``nvlink`` is a functional linker.
 233
 234 .. code-block:: c++
 235
 236   #include <stdio.h>
 237
 238   int main(int argc, char **argv, char **envp) {
 239     printf("Hello from NVPTX!\n");
 240   }
 241
 242 Additionally, the NVPTX ABI requires that every function signature matches. This
 243 requires us to pass the full prototype from ``main``. The installation will
 244 contain the ``nvptx-loader`` utility if the CUDA driver was found during
 245 compilation. Using link time optimization will help hide this.
 246
 247 .. code-block:: sh
 248
 249   $> clang hello.c --target=nvptx64-nvidia-cuda -mcpu=native -flto -lc <install>/lib/nvptx64-nvidia-cuda/crt1.o
 250   $> nvptx-loader --threads 2 --blocks 2 a.out
 251   Hello from NVPTX!
 252   Hello from NVPTX!
 253   Hello from NVPTX!
 254   Hello from NVPTX!