libc/docs/gpu/using.rst

   1 .. _libc_gpu_usage:
   2
   3 ===================
   4 Using libc for GPUs
   5 ===================
   6
   7 .. contents:: Table of Contents
   8   :depth: 4
   9   :local:
  10
  11 Using the GPU C library
  12 =======================
  13
  14 Once you have finished :ref:`building<libc_gpu_building>` the GPU C library it
  15 can be used to run libc or libm functions directly on the GPU. Currently, not
  16 all C standard functions are supported on the GPU. Consult the :ref:`list of
  17 supported functions<libc_gpu_support>` for a comprehensive list.
  18
  19 The GPU C library supports two main usage modes. The first is as a supplementary
  20 library for offloading languages such as OpenMP, CUDA, or HIP. These aim to
  21 provide standard system utilities similarly to existing vendor libraries. The
  22 second method treats the GPU as a hosted target by compiling C or C++ for it
  23 directly. This is more similar to targeting OpenCL and is primarily used for
  24 exported functions on the GPU and testing.
  25
  26 Offloading usage
  27 ----------------
  28
  29 Offloading languages like CUDA, HIP, or OpenMP work by compiling a single source
  30 file for both the host target and a list of offloading devices. In order to
  31 support standard compilation flows, the ``clang`` driver uses fat binaries,
  32 described in the `clang documentation
  33 <https://clang.llvm.org/docs/OffloadingDesign.html>`_. This linking mode is used
  34 by the OpenMP toolchain, but is currently opt-in for the CUDA and HIP toolchains
  35 through the ``--offload-new-driver``` and ``-fgpu-rdc`` flags.
  36
  37 The installation should contain a static library called ``libcgpu-amdgpu.a`` or
  38 ``libcgpu-nvptx.a`` depending on which GPU architectures your build targeted.
  39 These contain fat binaries compatible with the offloading toolchain such that
  40 they can be used directly.
  41
  42 .. code-block:: sh
  43
  44   $> clang openmp.c -fopenmp --offload-arch=gfx90a -lcgpu-amdgpu
  45   $> clang cuda.cu --offload-arch=sm_80 --offload-new-driver -fgpu-rdc -lcgpu-nvptx
  46   $> clang hip.hip --offload-arch=gfx940 --offload-new-driver -fgpu-rdc -lcgpu-amdgpu
  47
  48 This will automatically link in the needed function definitions if they were
  49 required by the user's application. Normally using the ``-fgpu-rdc`` option
  50 results in sub-par performance due to ABA linking. However, the offloading
  51 toolchain supports the ``--foffload-lto`` option to support LTO on the target
  52 device.
  53
  54 Offloading languages require that functions present on the device be declared as
  55 such. This is done with the ``__device__`` keyword in CUDA and HIP or the
  56 ``declare target`` pragma in OpenMP. This requires that the LLVM C library
  57 exposes its implemented functions to the compiler when it is used to build. We
  58 support this by providing wrapper headers in the compiler's resource directory.
  59 These are located in ``<clang-resource-dir>/include/llvm-libc-wrappers`` in your
  60 installation.
  61
  62 The support for HIP and CUDA is more experimental, requiring manual intervention
  63 to link and use the facilities. An example of this is shown in the :ref:`CUDA
  64 server example<libc_gpu_cuda_server>`. The OpenMP Offloading toolchain is
  65 completely integrated with the LLVM C library however. It will automatically
  66 handle including the necessary libraries, define device-side interfaces, and run
  67 the RPC server.
  68
  69 OpenMP Offloading example
  70 ^^^^^^^^^^^^^^^^^^^^^^^^^
  71
  72 This section provides a simple example of compiling an OpenMP program with the
  73 GPU C library.
  74
  75 .. code-block:: c++
  76
  77   #include <stdio.h>
  78
  79   int main() {
  80     FILE *file = stderr;
  81   #pragma omp target teams num_teams(2) thread_limit(2)
  82   #pragma omp parallel num_threads(2)
  83     { fputs("Hello from OpenMP!\n", file); }
  84   }
  85
  86 This can simply be compiled like any other OpenMP application to print from two
  87 threads and two blocks.
  88
  89 .. code-block:: sh
  90
  91   $> clang openmp.c -fopenmp --offload-arch=gfx90a
  92   $> ./a.out
  93   Hello from OpenMP!
  94   Hello from OpenMP!
  95   Hello from OpenMP!
  96   Hello from OpenMP!
  97
  98 Including the wrapper headers, linking the C library, and running the :ref:`RPC
  99 server<libc_gpu_rpc>` are all handled automatically by the compiler and runtime.
 100
 101 Binary format
 102 ^^^^^^^^^^^^^
 103
 104 The ``libcgpu.a`` static archive is a fat-binary containing LLVM-IR for each
 105 supported target device. The supported architectures can be seen using LLVM's
 106 ``llvm-objdump`` with the ``--offloading`` flag:
 107
 108 .. code-block:: sh
 109
 110   $> llvm-objdump --offloading libcgpu-amdgpu.a
 111   libcgpu-amdgpu.a(strcmp.cpp.o):    file format elf64-x86-64
 112
 113   OFFLOADING IMAGE [0]:
 114   kind            llvm ir
 115   arch            generic
 116   triple          amdgcn-amd-amdhsa
 117   producer        none
 118   ...
 119
 120 Because the device code is stored inside a fat binary, it can be difficult to
 121 inspect the resulting code. This can be done using the following utilities:
 122
 123 .. code-block:: sh
 124
 125   $> llvm-ar x libcgpu.a strcmp.cpp.o
 126   $> clang-offload-packager strcmp.cpp.o --image=arch=generic,file=strcmp.bc
 127   $> opt -S out.bc
 128   ...
 129
 130 Please note that this fat binary format is provided for compatibility with
 131 existing offloading toolchains. The implementation in ``libc`` does not depend
 132 on any existing offloading languages and is completely freestanding.
 133
 134 Direct compilation
 135 ------------------
 136
 137 Instead of using standard offloading languages, we can also target the CPU
 138 directly using C and C++ to create a GPU executable similarly to OpenCL. This is
 139 done by targeting the GPU architecture using `clang's cross compilation
 140 support <https://clang.llvm.org/docs/CrossCompilation.html>`_. This is the
 141 method that the GPU C library uses both to build the library and to run tests.
 142
 143 This allows us to easily define GPU specific libraries and programs that fit
 144 well into existing tools. In order to target the GPU effectively we rely heavily
 145 on the compiler's intrinsic and built-in functions. For example, the following
 146 function gets the thread identifier in the 'x' dimension on both GPUs supported
 147 GPUs.
 148
 149 .. code-block:: c++
 150
 151   uint32_t get_thread_id_x() {
 152   #if defined(__AMDGPU__)
 153     return __builtin_amdgcn_workitem_id_x();
 154   #elif defined(__NVPTX__)
 155     return __nvvm_read_ptx_sreg_tid_x();
 156   #else
 157   #error "Unsupported platform"
 158   #endif
 159   }
 160
 161 We can then compile this for both NVPTX and AMDGPU into LLVM-IR using the
 162 following commands. This will yield valid LLVM-IR for the given target just like
 163 if we were using CUDA, OpenCL, or OpenMP.
 164
 165 .. code-block:: sh
 166
 167   $> clang id.c --target=amdgcn-amd-amdhsa -mcpu=native -nogpulib -flto -c
 168   $> clang id.c --target=nvptx64-nvidia-cuda -march=native -nogpulib -flto -c
 169
 170 We can also use this support to treat the GPU as a hosted environment by
 171 providing a C library and startup object just like a standard C library running
 172 on the host machine. Then, in order to execute these programs, we provide a
 173 loader utility to launch the executable on the GPU similar to a cross-compiling
 174 emulator. This is how we run :ref:`unit tests <libc_gpu_testing>` targeting the
 175 GPU. This is clearly not the most efficient way to use a GPU, but it provides a
 176 simple method to test execution on a GPU for debugging or development.
 177
 178 Building for AMDGPU targets
 179 ^^^^^^^^^^^^^^^^^^^^^^^^^^^
 180
 181 The AMDGPU target supports several features natively by virtue of using ``lld``
 182 as its linker. The installation will include the ``include/amdgcn-amd-amdhsa``
 183 and ``lib/amdgcn-amd-amdha`` directories that contain the necessary code to use
 184 the library. We can directly link against ``libc.a`` and use LTO to generate the
 185 final executable.
 186
 187 .. code-block:: c++
 188
 189   #include <stdio.h>
 190
 191   int main() { fputs("Hello from AMDGPU!\n", stdout); }
 192
 193 This program can then be compiled using the ``clang`` compiler. Note that
 194 ``-flto`` and ``-mcpu=`` should be defined. This is because the GPU
 195 sub-architectures do not have strict backwards compatibility. Use ``-mcpu=help``
 196 for accepted arguments or ``-mcpu=native`` to target the system's installed GPUs
 197 if present. Additionally, the AMDGPU target always uses ``-flto`` because we
 198 currently do not fully support ELF linking in ``lld``. Once built, we use the
 199 ``amdhsa-loader`` utility to launch execution on the GPU. This will be built if
 200 the ``hsa_runtime64`` library was found during build time.
 201
 202 .. code-block:: sh
 203
 204   $> clang hello.c --target=amdgcn-amd-amdhsa -mcpu=native -flto -lc <install>/lib/amdgcn-amd-amdhsa/crt1.o
 205   $> amdhsa-loader --threads 2 --blocks 2 a.out
 206   Hello from AMDGPU!
 207   Hello from AMDGPU!
 208   Hello from AMDGPU!
 209   Hello from AMDGPU!
 210
 211 This will include the ``stdio.h`` header, which is found in the
 212 ``include/amdgcn-amd-amdhsa`` directory. We define out ``main`` function like a
 213 standard application. The startup utility in ``lib/amdgcn-amd-amdhsa/crt1.o``
 214 will handle the necessary steps to execute the ``main`` function along with
 215 global initializers and command line arguments. Finally, we link in the
 216 ``libc.a`` library stored in ``lib/amdgcn-amd-amdhsa`` to define the standard C
 217 functions.
 218
 219 The search paths for the include directories and libraries are automatically
 220 handled by the compiler. We use this support internally to run unit tests on the
 221 GPU directly. See :ref:`libc_gpu_testing` for more information. The installation
 222 also provides ``libc.bc`` which is a single LLVM-IR bitcode blob that can be
 223 used instead of the static library.
 224
 225 Building for NVPTX targets
 226 ^^^^^^^^^^^^^^^^^^^^^^^^^^
 227
 228 The infrastructure is the same as the AMDGPU example. However, the NVPTX binary
 229 utilities are very limited and must be targeted directly. There is no linker
 230 support for static libraries so we need to link in the ``libc.bc`` bitcode and
 231 inform the compiler driver of the file's contents.
 232
 233 .. code-block:: c++
 234
 235   #include <stdio.h>
 236
 237   int main(int argc, char **argv, char **envp) {
 238     fputs("Hello from NVPTX!\n", stdout);
 239   }
 240
 241 Additionally, the NVPTX ABI requires that every function signature matches. This
 242 requires us to pass the full prototype from ``main``. The installation will
 243 contain the ``nvptx-loader`` utility if the CUDA driver was found during
 244 compilation.
 245
 246 .. code-block:: sh
 247
 248   $> clang hello.c --target=nvptx64-nvidia-cuda -march=native \
 249        -x ir <install>/lib/nvptx64-nvidia-cuda/libc.bc \
 250        -x ir <install>/lib/nvptx64-nvidia-cuda/crt1.o
 251   $> nvptx-loader --threads 2 --blocks 2 a.out
 252   Hello from NVPTX!
 253   Hello from NVPTX!
 254   Hello from NVPTX!
 255   Hello from NVPTX!