mlir/docs/Tutorials/transform/Ch0.md

   1 # Chapter 0: A Primer on “Structured” Linalg Operations
   2
   3 Before starting the tutorial on the Transform dialect, let us take a brief look at the concept of Structured operations and its implementation in the Linalg dialect. Note that the Transform dialect does not require Structured operations and vice versa. The two co-evolved at the beginning of the Transform dialect, which makes the subset of transformations for Structured operations the most mature and most suitable for the tutorial. If you are already familiar with this concept, skip to Chapter 1.
   4
   5 Structured code generation intends to preserve the structure of the computation for as long as necessary to enable transformations, up to and including the design of IR abstractions that support specific transformations.
   6
   7 ## Uniform Elementwise Extension
   8
   9 Consider a simple scalar arithmetic addition operation in MLIR, which maps directly to a machine instruction on most architectures that support floating point operations:
  10
  11
  12 ```mlir
  13 %2 = arith.addf %0, %1 : f32
  14 ```
  15
  16 This operation can be easily extended to uniformly apply to elements of a 1D vector, which is also often available as an instruction of vector machines:
  17
  18 ```mlir
  19 %2 = arith.addf %0, %1 : vector<8xf32>
  20 ```
  21
  22 Only a few modern instruction sets offer instructions for two- or more-dimensional vectors. In MLIR, however, it is possible to transparently extend the uniform elementwise application to vectors of arbitrary rank.
  23
  24 ```mlir
  25 %2 = arith.addf %0, %1 : vector<8x4xf32>
  26 %5 = arith.addf %3, %4 : vector<2x2x2x2x2x2x2xf32>
  27 ```
  28
  29 As you can notice, MLIR’s arithmetic operations on vectors preserve the structure of uniform elementwise application. This structure can be leveraged by the compiler, for example, to produce smaller-rank operations available on the target or to fuse multiplication and addition when such a fused instruction is available (which becomes complicated when there are a hundred of multiplications followed by a hundred of additions).
  30
  31 ## Reduction
  32
  33 Sometimes it is necessary to add elements of a vector to obtain a scalar. Some platforms provide specific instructions for this operation, some others provide ones that can be combined to achieve the desired effect, such as addition of adjacent elements and element shuffle.
  34
  35 The Vector dialect in MLIR defines an operation to explicitly denote a within-vector reduction:
  36
  37 ```mlir
  38 %0 = vector.reduction <add>, %0 : vector<8xf32> into f32
  39 ```
  40
  41 When no support is available, such an operation can be transformed into a loop:
  42
  43 ```mlir
  44 %c0 = arith.constant 0 : index
  45 %c1 = arith.constant 1 : index
  46 %c8 = arith.constant 8 : index
  47 %init = arith.constant 0.0 : f32
  48 %result = scf.for %i = %c0 to %c8 step %c1 iter_args(%partial = %init) -> (f32) {
  49   %element = vector.extractelement %0[%i : index] : vector<8xf32>
  50   %updated = arith.addf %partial, %element : f32
  51   scf.yield %updated : f32
  52 }
  53 ```
  54
  55 Even when special instructions are available, it may still be desirable to use the loop form (with unrolling), depending on instruction latency and register pressure. Preserving the structure of the operation as a single reduction gives the compiler an understanding that a within-vector reduction is performed and, therefore, a choice in implementation.
  56
  57 ## Contraction
  58
  59 Contraction is a generalization of reduction that multiplies elements from two vectors before adding them up. A simple “add” reduction can be thought of as a contraction where one of the vectors contains `1.0`, the neutral element of multiplication. Contractions offer even more flexibility to the compiler, and are represented by a dedicated operation in MLIR:
  60
  61 ```mlir
  62 // Neutral initializer for the addition.
  63 %init  = arith.constant 0.0 : f32
  64 // Neutral element of multiplication.
  65 %ones = arith.constant dense<1.0> : vector<8xf32>
  66 // Actual contraction.
  67 %result = vector.contract {
  68   indexing_maps = [affine_map<(i) -> (i)>,
  69                    affine_map<(i) -> (i)>,
  70                    affine_map<(i) -> ()>],
  71   iterator_types = ["reduction"]
  72 } %0, %ones, %init : vector<8xf32>, vector<8xf32> into f32
  73 ```
  74
  75 Note the `affine_map` expressions indicating how vector elements are indexed. Their meaning is perhaps most evident when writing the loop form in pseudo-code equivalent to this contraction:
  76
  77 ```mlir
  78 for i in 0 to 8:
  79   init += p0[i] * ones[i]
  80 ```
  81
  82 where both `%0` and `%ones` use the loop induction variable `i`, as noted on the right-hand side of the corresponding affine map, `(i) -> (i)`, and `%init` does not, as reflected on the right-hand side of its affine map, `(i) -> ()`.
  83
  84 Similarly to uniform elementwise extension, MLIR vector contractions are not limited to 1D cases. In the 2D+ case, one can additionally specify which of the vector dimensions are being reduced and which ones are being preserved. This can be achieved by using the `iterator_types` attribute that specifies, for each dimension, whether it is being reduced (`"reduction"`) or preserved (`"parallel"`). Consider the following 3D contraction that encodes a matrix-matrix multiplication:
  85
  86 ```mlir
  87 %result = vector.contract {
  88   indexing_maps = [affine_map<(i, j, k) -> (i, k)>,
  89                    affine_map<(i, j, k) -> (k, j)>,
  90                    affine_map<(i, j, k) -> (i, j)>],
  91   iterator_types = ["parallel", "parallel", "reduction"]
  92 } %lhs, %rhs, %init: vector<8x10xf32>, vector<10x16xf32> into vector<8x16xf32>
  93 ```
  94
  95 Looking at the indexing maps, it is easy to recognize the loop form:
  96
  97 ```mlir
  98 for i in 0 to 8:
  99   for j in 0 to 16:
 100     for k in 0 to 10:
 101       init[i, j] += lhs[i, k] * rhs[k, j]
 102 ```
 103
 104 Preserving this higher-level structure of a contraction makes it significantly easier for the compiler to recognize operations such as matrix multiplications and dot products and gives it freedom to produce lower-level operations that leverage most advanced instructions or even pre-generated microkernels.
 105
 106 ## Generic Operation on Memory
 107
 108 Until now, we have been considering operations on vectors stored in virtual registers. A similar contraction abstraction can be defined in memory:
 109
 110 ```mlir
 111 linalg.generic {
 112   indexing_maps = [affine_map<(i, j, k) -> (i, k)>,
 113                    affine_map<(i, j, k) -> (k, j)>,
 114                    affine_map<(i, j, k) -> (i, j)>],
 115   iterator_types = ["parallel", "parallel", "reduction"]
 116 } ins(%lhs, %rhs : memref<8x10xf32>, memref<10x16xf32>)
 117   outs(%init : memref<8x16xf32>) {
 118 ^bb0(%lhs_one: f32, %rhs_one: f32, %init_one: f32):
 119   %0 = arith.mulf %lhs_one, %rhs_one : f32
 120   %1 = arith.addf %init_one, %0 : f32
 121   linalg.yield %1 : f32
 122 }
 123 ```
 124
 125 This looks more complicated, so let us unpack. The `indexing_maps` and `iterator_types` are _exactly_ the same as we have seen above for vector contractions. The operands are now split into two lists:
 126
 127
 128 *   `in` operands containing the buffers that are being only read by the operation;
 129 *   `out` operands that are being read and updated by the operation.
 130
 131 This separation wasn’t necessary on vectors because, in MLIR, vectors are read-only (SSA or functional form) and operations mutating a vector are in fact producing a new one instead.
 132
 133 Furthermore, the operation now contains a region that explicitly specifies the multiplication and the addition operations that were implicit in the contraction. Block arguments in the region correspond to individual elements read from the buffer: the first two correspond to the `in` operands and the last one corresponds to the `out` operand. The value yielded from the region is “written” to the `out` operand and is available as the last block argument for future executions of the region. Note that the order in which the region is executed for various tuples of elements read from the buffers is not specified, and the write to the `out` buffer is written as a whole at the end of the operation.
 134
 135 ## “Loop” Fusion
 136
 137 Since the region of the `linalg.generic` operation can contain arbitrarily many operations, we can use it to express “fusion” of the implicit loops by simply having more operations chained in the region. For example, the common machine learning rectified linear unit layer (ReLU), which can be defined as `relu(x) = max(0, x)`, can be defined be expressed using the “compare-and-select” idiom in one `linalg.generic` operation, without the temporary buffer for the comparison result and without repeating the outer operation:
 138
 139 ```mlir
 140 linalg.generic {
 141   indexing_maps [affine_map<(i) -> (i)>, affine_map<(i) -> (i)>],
 142   iterator_types = ["parallel"]
 143 } ins(%in : memref<?xf32>) outs(%out : memref<?xf32>) {
 144 ^bb0(%in_one : f32, %out_one : f32):
 145   %c0 = arith.constant 0.0 : f32
 146   %0 = arith.cmpf ogt %in_one, %c0 : f32
 147   %1 = arith.select %0, %in_one, %c0 : f32
 148   linalg.yield %1 : f32
 149 }
 150 ```
 151
 152 Such operations can be converted to loops or lowered into vector forms after splitting into multiple operations, each of which maps to a Vector dialect primitive. This modeling, again, gives the compiler more choice in selecting the code generation strategy.
 153
 154 ## Generic Operation on Tensors
 155
 156 Let us take one last step up on the abstraction ladder. MLIR provides a tensor abstraction that makes it easy for the compiler to reason about multidimensional yet regular data without having to solve complex problems such as alias analysis and dependency satisfaction, which would be necessary on multidimensional buffers. The tensor abstraction is very similar to the vector abstraction (major differences include the availability of unranked tensors, tensor layouts, and vectors being usable as elemental types of tensors but not of other vectors). Tensors are read-only, and operations updating a tensor produce a new tensor.
 157
 158 The `linalg.generic` operation from above can lifted to operate on tensors instead of buffers:
 159
 160 ```mlir
 161 %result = linalg.generic {
 162   indexing_maps = [affine_map<(i, j, k) -> (i, k)>,
 163                    affine_map<(i, j, k) -> (k, j)>,
 164                    affine_map<(i, j, k) -> (i, j)>],
 165   iterator_types = ["parallel", "parallel", "reduction"]
 166 } ins(%lhs, %rhs : tensor<8x10xf32>,tensor<10x16xf32>)
 167   outs(%init :tensor<8x16xf32>) {
 168 ^bb0(%lhs_one: f32, %rhs_one: f32, %init_one: f32):
 169   %0 = arith.mulf %lhs_one, %rhs_one : f32
 170   %1 = arith.addf %init_one, %0 : f32
 171   linalg.yield %1 : f32
 172 } -> tensor<8x16xf32>
 173 ```
 174
 175 As you can notice, most components of this operation remain identical to its buffer version. It has been specifically designed this way. The main difference, beside the operand types, is that the operation now produces a new result instead of updating the `out` buffer. The `out` operand is used only as the initialization value.
 176
 177 If the `linalg.generic` operation had existed on vectors, it would have had the exact same structure.
 178
 179 ## Tiling and Loop Materialization
 180
 181 At this level of abstraction, it becomes easy for the compiler to perform more advanced transformations usually required for high-performance code generation, such as [tiling](https://en.wikipedia.org/wiki/Loop_nest_optimization). Tiling, in general, can be seen as partitioning the iteration space into smaller parts, or tiles, so that the data required by each part fits into a level of cache for example. The order in which tiles are executed must preserve the original data dependencies.
 182
 183 In the case of `linalg.generic` operations, the iteration space is implicit and is defined by the shape of the operands. Therefore, a tile can be expressed by performing the _same_ operation on a subset (slice) of the original data. Since the order in which the body of `linalg.generic` is applied to different tuples of the input elements is unspecified, tiles can be executed in any order, without the need for dependence analysis. In order to control the execution of different tiles, the implementation of tiling produces loops. Thus tiling `linalg.generic` operations can also be seen as materializing the loops that have been implicit until now.
 184
 185 For example, tiling the matrix multiplication presented above with tile sizes `(2, 8)`, we obtain a loop nest around a `linalg.generic` expressing the same operation on a `2x8` tensor.
 186
 187 ```mlir
 188 // A special "multi-for" loop that supports tensor-insertion semantics
 189 // as opposed to implicit updates. The resulting 8x16 tensor will be produced
 190 // by this loop.
 191 // The trip count of iterators is computed dividing the original tensor size,
 192 // 8x16, by the tile size, 2x8, to obtain 4x2.
 193 // When tensor sizes are dynamic, the trip count computation is emitted as IR
 194 // and is being computed at runtime.
 195 %0 = scf.forall (%i, %j) in (4, 2)
 196      shared_outs(%shared = %init) -> (tensor<8x16xf32>) {
 197
 198   // Scale the loop induction variables by the tile sizes.
 199   %3 = affine.apply affine_map<(d0) -> (d0 * 2)>(%i)
 200   %4 = affine.apply affine_map<(d0) -> (d0 * 8)>(%j)
 201
 202   // Take slices of inputs and outputs. Only the "i" and "j" dimensions are sliced.
 203   %lhs_slice = tensor.extract_slice %lhs[%3, 0] [2, 10] [1, 1]
 204              : tensor<8x10xf32> to tensor<2x10xf32>
 205   %rhs_slice = tensor.extract_slice %rhs[0, %4] [10, 8] [1, 1]
 206              : tensor<10x16xf32> to tensor<10x8xf32>
 207   %result_slice = tensor.extract_slice %shared[%3, %4] [2, 8] [1, 1]
 208                 : tensor<8x16xf32> to tensor<2x8xf32>
 209
 210   // This is exactly the same operation as before, but now operating on smaller
 211   // slices of data.
 212   %partial =  linalg.generic {
 213   indexing_maps = [affine_map<(i, j, k) -> (i, k)>,
 214                    affine_map<(i, j, k) -> (k, j)>,
 215                    affine_map<(i, j, k) -> (i, j)>],
 216   iterator_types = ["parallel", "parallel", "reduction"]
 217   } ins(%lhs_slice, %rhs_slice : tensor<2x10xf32>, tensor<10x8xf32>)
 218     outs(%result_slice : tensor<2x8xf32>) -> tensor<2x8xf32> {
 219   ^bb0(%lhs_one: f32, %rhs_one: f32, %init_one: f32):
 220     %0 = arith.mulf %lhs_one, %rhs_one : f32
 221     %1 = arith.addf %init_one, %0 : f32
 222     linalg.yield %1 : f32
 223   } : tensor<2x8xf32>
 224
 225   // Terminator for the loop with tensor-insertion semantics. Inserts a slice
 226   // into a larger tensor, potentially in parallel.
 227   scf.forall.in_parallel {
 228     tensor.parallel_insert_slice %partial into %shared[%3, %4] [2, 8] [1, 1]
 229         : tensor<2x8xf32> into tensor<8x16xf32>
 230   }
 231 }
 232 ```
 233
 234 ## Producer/Consumer Fusion and Rematerialization
 235
 236 After materializing loops with tiling, another key code generation transformation becomes simple – fusion. Unlike loop fusion, the Structured operations approach allows for producer/consumer fusion even when the (implicit) iteration spaces of the operations do not match. Given an high-level structured operation on tensors, such as `linalg.generic`, one can follow use-def chains to identify:
 237
 238 1. the subset (slice) of the operand that is used by the tile, and
 239 2. the tensor-level structured operation producing the whole tensor that is being sliced.
 240
 241 By inverting the `indexing_map` and applying it to the set of elements accessed through the slice, we can compute the part of the iteration space of the operation defining the full tensor necessary to compute the tile. Thus fusion boils down to replacing the `tensor.extract_slice` operation with the tile of the `linalg.generic` producing the original operand.
 242
 243 Let us assume that the matrix multiplication operation is followed by another operation that multiplies each element of the resulting matrix with itself. This trailing elementwise operation has a 2D iteration space, unlike the 3D one in matrix multiplication. Nevertheless, it is possible to tile the trailing operation and then fuse the producer of its operand, the matmul, into the loop generated by tiling. The untiled dimension will be used in its entirety.
 244
 245
 246 ```mlir
 247 // Same loop as before.
 248 %0 = scf.forall (%i, %j) in (4, 2)
 249      shared_outs(%shared = %init)
 250      -> (tensor<8x16xf32>, tensor<8x16xf32>) {
 251   // Scale the loop induction variables by the tile sizes.
 252   %1 = affine.apply affine_map<(d0) -> (d0 * 2)>(%i)
 253   %2 = affine.apply affine_map<(d0) -> (d0 * 8)>(%j)
 254
 255   // Take slices of inputs and outputs. Only the "i" and "j" dimensions are sliced.
 256   %lhs_slice = tensor.extract_slice %lhs[%1, 0] [2, 10] [1, 1]
 257              : tensor<8x10xf32> to tensor<2x10xf32>
 258   %rhs_slice = tensor.extract_slice %rhs[0, %2] [10, 8] [1, 1]
 259              : tensor<10x16xf32> to tensor<10x8xf32>
 260   %result_slice = tensor.extract_slice %result[%1, %2] [2, 8] [1, 1]
 261                 : tensor<8x16xf32> to tensor<2x8xf32>
 262
 263   // This is exactly the same matmul slice as before. It replaces the slice
 264   // extraction for the generic operation below.
 265   %partial = linalg.generic {
 266     indexing_maps = [affine_map<(i, j, k) -> (i, k)>,
 267                      affine_map<(i, j, k) -> (k, j)>,
 268                      affine_map<(i, j, k) -> (i, j)>],
 269     iterator_types = ["parallel", "parallel", "reduction"]
 270   } ins(%lhs_slice, %rhs_slice : tensor<2x10xf32>, tensor<10x8xf32>)
 271    outs(%result_slice : tensor<2x8xf32>) {
 272   ^bb0(%lhs_one: f32, %rhs_one: f32, %init_one: f32):
 273     %5 = arith.mulf %lhs_one, %rhs_one : f32
 274     %6 = arith.addf %init_one, %5 : f32
 275     linalg.yield %6 : f32
 276   } -> tensor<2x8xf32>
 277
 278   // Take the slice of the final result. Note that we don't need to take
 279   // the slice of the operand because the matmul operation above computes
 280   // it in-place.
 281   %shared_slice = tensor.extract_slice %shared[%1, %2] [2, 8] [1, 1]
 282                 : tensor<8x16xf32> to tensor<2x8xf32>
 283
 284   // The elementwise operation that we tiled.
 285   %elemwise = linalg.generic {
 286     indexing_maps = [affine_map<(i, j) -> (i, j)>,
 287                      affine_map<(i, j) -> (i, j)>],
 288     iterator_types = ["parallel", "parallel"]
 289   } ins(%partial : tensor<2x8xf32>)
 290    outs(%shared_slice : tensor<2x8xf32>) {
 291   ^bb0(%in: f32, %out: f32):
 292     %5 = arith.mulf %in, %in : f32
 293     linalg.yield %5 : f32
 294   } -> tensor<2x8xf32>
 295
 296   // Terminator for the loop with tensor-insertion semantics. Inserts a slice
 297   // into a larger tensor, potentially in parallel.
 298   scf.forall.in_parallel {
 299     tensor.parallel_insert_slice %elemwise into %shared[%1, %2] [2, 8] [1, 1]
 300         : tensor<2x8xf32> into tensor<8x16xf32>
 301   }
 302 }
 303 ```
 304
 305 This process may result in some elements in the operand tensors being (re)computed on every iteration of the loop. This is also known as _rematerialization_ and expresses the tradeoff between performing redundant computations or storing their result in (slow) memory.
 306
 307 ## Shorthand “Named” Forms of Linalg Ops
 308
 309 Linalg provides a set of predefined operations for common cases such as matrix multiplication, dot product, convolution, etc. These operations are equivalent to the `generic` ones but spare the need to spell out the access patterns and the bodies. For example, matrix multiplication is simply:
 310
 311 ```mlir
 312 %matmul = linalg.matmul ins(%lhs, %rhs: tensor<8x10xf32>, tensor<10x16xf32>)
 313                         outs(%init: tensor<8x10xf32xf32>) -> tensor<8x16xf32>
 314 ```