mlir/docs/Dialects/GPU.md

   1 # 'gpu' Dialect
   2
   3 Note: this dialect is more likely to change than others in the near future; use
   4 with caution.
   5
   6 This dialect provides middle-level abstractions for launching GPU kernels
   7 following a programming model similar to that of CUDA or OpenCL. It provides
   8 abstractions for kernel invocations (and may eventually provide those for device
   9 management) that are not present at the lower level (e.g., as LLVM IR intrinsics
  10 for GPUs). Its goal is to abstract away device- and driver-specific
  11 manipulations to launch a GPU kernel and provide a simple path towards GPU
  12 execution from MLIR. It may be targeted, for example, by DSLs using MLIR. The
  13 dialect uses `gpu` as its canonical prefix.
  14
  15 [TOC]
  16
  17 ## Memory attribution
  18
  19 Memory buffers are defined at the function level, either in "gpu.launch" or in
  20 "gpu.func" ops. This encoding makes it clear where the memory belongs and makes
  21 the lifetime of the memory visible. The memory is only accessible while the
  22 kernel is launched/the function is currently invoked. The latter is more strict
  23 than actual GPU implementations but using static memory at the function level is
  24 just for convenience. It is also always possible to pass pointers to the
  25 workgroup memory into other functions, provided they expect the correct memory
  26 space.
  27
  28 The buffers are considered live throughout the execution of the GPU function
  29 body. The absence of memory attribution syntax means that the function does not
  30 require special buffers. Rationale: although the underlying models declare
  31 memory buffers at the module level, we chose to do it at the function level to
  32 provide some structuring for the lifetime of those buffers; this avoids the
  33 incentive to use the buffers for communicating between different kernels or
  34 launches of the same kernel, which should be done through function arguments
  35 instead; we chose not to use `alloca`-style approach that would require more
  36 complex lifetime analysis following the principles of MLIR that promote
  37 structure and representing analysis results in the IR.
  38
  39 ## GPU Compilation
  40 ### Deprecation notice
  41 The `--gpu-to-(cubin|hsaco)` passes will be deprecated in a future release.
  42
  43 ### Compilation overview
  44 The compilation process in the GPU dialect has two main stages: GPU module
  45 serialization and offloading operations translation. Together these stages can
  46 produce GPU binaries and the necessary code to execute them.
  47
  48 An example of how the compilation workflow look is:
  49
  50 ```
  51 mlir-opt example.mlir                   \
  52   --pass-pipeline="builtin.module(      \
  53     nvvm-attach-target{chip=sm_90 O=3}, \ # Attach an NVVM target to a gpu.module op.
  54     gpu.module(convert-gpu-to-nvvm),    \ # Convert GPU to NVVM.
  55     gpu-to-llvm,                        \ # Convert GPU to LLVM.
  56     gpu-module-to-binary                \ # Serialize GPU modules to binaries.
  57   )" -o example-nvvm.mlir
  58 mlir-translate example-nvvm.mlir        \
  59   --mlir-to-llvmir                      \ # Obtain the translated LLVM IR.
  60   -o example.ll
  61 ```
  62
  63 ### Module serialization
  64 Attributes implementing the GPU Target Attribute Interface handle the
  65 serialization process and are called Target attributes. These attributes can be
  66 attached to GPU Modules indicating the serialization scheme to compile the
  67 module into a binary string.
  68
  69 The `gpu-module-to-binary` pass searches for all nested GPU modules and
  70 serializes the module using the target attributes attached to the module,
  71 producing a binary with an object for every target.
  72
  73 Example:
  74 ```
  75 // Input:
  76 gpu.module @kernels [#nvvm.target<chip = "sm_90">, #nvvm.target<chip = "sm_60">] {
  77   ...
  78 }
  79 // mlir-opt --gpu-module-to-binary:
  80 gpu.binary @kernels [
  81   #gpu.object<#nvvm.target<chip = "sm_90">, "sm_90 cubin">,
  82   #gpu.object<#nvvm.target<chip = "sm_60">, "sm_60 cubin">
  83 ]
  84 ```
  85
  86 ### Offloading LLVM translation
  87 Attributes implementing the GPU Offloading LLVM Translation Attribute Interface
  88 handle the translation of GPU binaries and kernel launches into LLVM
  89 instructions and are called Offloading attributes. These attributes are
  90 attached to GPU binary operations.
  91
  92 During the LLVM translation process, GPU binaries get translated using the
  93 scheme provided by the Offloading attribute, translating the GPU binary into
  94 LLVM instructions. Meanwhile, Kernel launches are translated by searching the
  95 appropriate binary and invoking the procedure provided by the Offloading
  96 attribute in the binary for translating kernel launches into LLVM instructions.
  97
  98 Example:
  99 ```
 100 // Input:
 101 // Binary with multiple objects but selecting the second one for embedding.
 102 gpu.binary @binary <#gpu.select_object<#rocdl.target<chip = "gfx90a">>> [
 103     #gpu.object<#nvvm.target, "NVPTX">,
 104     #gpu.object<#rocdl.target<chip = "gfx90a">, "AMDGPU">
 105   ]
 106 llvm.func @foo() {
 107   ...
 108   // Launching a kernel inside the binary.
 109   gpu.launch_func @binary::@func blocks in (%0, %0, %0)
 110                                  threads in (%0, %0, %0) : i64
 111                                  dynamic_shared_memory_size %2
 112                                  args(%1 : i32, %1 : i32)
 113   ...
 114 }
 115 // mlir-translate --mlir-to-llvmir:
 116 @binary_bin_cst = internal constant [6 x i8] c"AMDGPU", align 8
 117 @binary_func_kernel_name = private unnamed_addr constant [7 x i8] c"func\00", align 1
 118 ...
 119 define void @foo() {
 120   ...
 121   %module = call ptr @mgpuModuleLoad(ptr @binary_bin_cst)
 122   %kernel = call ptr @mgpuModuleGetFunction(ptr %module, ptr @binary_func_kernel_name)
 123   call void @mgpuLaunchKernel(ptr %kernel, ...) ; Launch the kernel
 124   ...
 125   call void @mgpuModuleUnload(ptr %module)
 126   ...
 127 }
 128 ...
 129 ```
 130
 131 ### The binary operation
 132 From a semantic point of view, GPU binaries allow the implementation of many
 133 concepts, from simple object files to fat binaries. By default, the binary
 134 operation uses the `#gpu.select_object` offloading attribute; this attribute
 135 embeds a single object in the binary as a global string, see the attribute docs
 136 for more information.
 137
 138 ## Operations
 139
 140 [include "Dialects/GPUOps.md"]