mlir/docs/Bufferization.md

   1 # Bufferization
   2
   3 [TOC]
   4
   5 ## Overview
   6
   7 Bufferization in MLIR is the process of converting ops with `tensor` semantics
   8 to ops with `memref` semantics. MLIR provides an infrastructure that bufferizes
   9 an entire program in a single pass (*One-Shot Bufferize*). This infrastructure
  10 bufferizes all ops that implement the
  11 [`BufferizableOpInterface`](https://github.com/llvm/llvm-project/blob/17a68065c378da74805e4e1b9a5b78cc9f83e580/mlir/include/mlir/Dialect/Bufferization/IR/BufferizableOpInterface.td)
  12 can be bufferized.
  13
  14 MLIR has an older bufferization infrastructure built around
  15 [dialect conversion](DialectConversion.md). Most dialect conversion
  16 bufferization patterns have been migrated to One-Shot Bufferize, but some
  17 functionality such as function boundary bufferization still depends on dialect
  18 conversion and its type converter. New projects should use One-Shot Bufferize,
  19 as the dialect conversion-based bufferization will eventually be deprecated.
  20 Moreover, One-Shot Bufferize results in better bufferization with fewer memory
  21 allocations and buffer copies. This documentation is mostly about One-Shot
  22 Bufferize, but also describes how to gradually migrate a project from dialect
  23 conversion-based bufferization to One-Shot Bufferize.
  24
  25 ## What is One-Shot Bufferize?
  26
  27 One-Shot Bufferize is a new tensor bufferization pass designed for IR in
  28 [destination-passing style](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/11/dps-fhpc17.pdf),
  29 and with aggressive in-place bufferization.
  30
  31 One-Shot Bufferize is:
  32
  33 *   **Monolithic**: A single MLIR pass does the entire work, whereas the
  34     previous bufferization in MLIR was split across multiple passes residing in
  35     different dialects. In One-Shot Bufferize, `BufferizableOpInterface`
  36     implementations are spread across different dialects.
  37
  38 *   A **whole-function at a time analysis**. In-place bufferization decisions
  39     are made by analyzing SSA use-def chains on tensors. Op interface
  40     implementations not only provide the rewrite logic from tensor ops to memref
  41     ops, but also helper methods for One-Shot Bufferize's analysis to query
  42     information about an op's bufferization/memory semantics.
  43
  44 *   **Extensible** via an op interface: All ops that implement
  45     `BufferizableOpInterface` can be bufferized.
  46
  47 *   **2-Pass**: Bufferization is internally broken down into 2 steps: First,
  48     analyze the entire IR and make bufferization decisions. Then, bufferize
  49     (rewrite) the IR. The analysis has access to exact SSA use-def information.
  50     It incrementally builds alias and equivalence sets and does not rely on a
  51     posteriori-alias analysis from preallocated memory.
  52
  53 *   **Greedy**: Operations are analyzed one-by-one and it is decided on the spot
  54     whether a tensor OpOperand must be copied or not. Heuristics determine the
  55     order of analysis.
  56
  57 *   **Modular**: The current One-Shot Analysis can be replaced with a different
  58     analysis. The result of the analysis are queried by the bufferization via
  59     `AnalysisState`, in particular `AnalysisState::isInPlace`. Any derived class
  60     of `AnalysisState` that implements a small number virtual functions can
  61     serve as a custom analysis. It is even possible to run One-Shot Bufferize
  62     without any analysis (`AlwaysCopyAnalysisState`), in which case One-Shot
  63     Bufferize behaves exactly like the old dialect conversion-based
  64     bufferization (i.e., copy every buffer before writing to it).
  65
  66 To reduce complexity, One-Shot Bufferize should be
  67 [run after other transformations](https://llvm.discourse.group/t/rfc-linalg-on-tensors-update-and-comprehensive-bufferization-rfc/3373),
  68 typically as one of the last steps right before lowering memref ops. Many
  69 transformations are easier in tensor land; e.g., tile/fuse/… on tensors first,
  70 then bufferize the remaining IR.
  71
  72 From an architecture perspective, One-Shot Bufferize consists of
  73 [BufferizableOpInterface](https://github.com/llvm/llvm-project/blob/17a68065c378da74805e4e1b9a5b78cc9f83e580/mlir/include/mlir/Dialect/Bufferization/IR/BufferizableOpInterface.td)
  74 (and its implementations) and an
  75 [analysis](https://github.com/llvm/llvm-project/blob/ae2764e835a26bad9774803eca0a6530df2a3e2d/mlir/include/mlir/Dialect/Bufferization/Transforms/OneShotAnalysis.h#L164)
  76 of tensor SSA values that decides if a buffer can be used directly or must be
  77 copied. The [bufferize] method of the op interface inspects analysis results and
  78 rewrites tensor ops into memref ops.
  79
  80 ## Goals of Bufferization
  81
  82 The high-level goal of every bufferization technique is to: 1. Use as little
  83 memory as possible. 2. Copy as little memory as possible.
  84
  85 This implies reusing already allocated buffers when possible, turning
  86 bufferization into an algorithmically complex problem with similarities to
  87 register allocation.
  88
  89 Depending on the concrete use case, there may be additional bufferization
  90 requirements. If the contents of a buffer are expensive to compute, there could
  91 be a tradeoff between *recomputation* and *compute once and copy*. On the
  92 contrary, it may not even be possible to allocate new buffers at runtime on some
  93 architectures.
  94
  95 ## Destination-Passing Style
  96
  97 Bufferization is an algorithmically complex problem. Given an op with a tensor
  98 result, bufferization has to choose a memref buffer in which the result can be
  99 stored. It is always safe to allocate a brand new buffer, but such a
 100 bufferization strategy would be unacceptable for high-performance codegen. When
 101 choosing an already existing buffer, we must be careful not to accidentally
 102 overwrite data that is still needed later in the program.
 103
 104 To simplify this problem, One-Shot Bufferize was designed to take advantage of
 105 *destination-passing style*. This form exists in itself independently of
 106 bufferization and is tied to SSA semantics: many ops are “updating” part of
 107 their input SSA variable. For example the LLVM instruction
 108 [`insertelement`](https://llvm.org/docs/LangRef.html#insertelement-instruction)
 109 is inserting an element inside a vector. Since SSA values are immutable, the
 110 operation returns a copy of the input vector with the element inserted.
 111 Another example in MLIR is `linalg.generic`, which always has an extra `outs`
 112 operand which provides the initial values to update (for example when the
 113 operation is doing a reduction).
 114
 115 This input is referred to as "destination" in the following (quotes are
 116 important as this operand isn't modified in place but copied) and comes into
 117 place in the context of bufferization as a possible "anchor" for the
 118 bufferization algorithm. This allows the user to shape the input in a form that
 119 guarantees close to optimal bufferization result when carefully choosing the
 120 SSA value used as "destination".
 121
 122 For every tensor result, a "destination-passing" style op has a corresponding
 123 tensor operand. If there aren't any other uses of this tensor, the bufferization
 124 can alias it with the op result and perform the operation "in-place" by reusing
 125 the buffer allocated for this "destination" input.
 126
 127 As an example, consider the following op: `%0 = tensor.insert %cst into
 128 %t[%idx] : tensor<?xf32>`
 129
 130 `%t` is the "destination" in this example. When choosing a buffer for the result
 131 `%0`, denoted as `buffer(%0)`, One-Shot Bufferize considers only two options:
 132
 133 1.  `buffer(%0) = buffer(%t)` : alias the "destination" tensor with the
 134     result and perform the operation in-place.
 135 2.  `buffer(%0)` is a newly allocated buffer.
 136
 137 There may be other buffers in the same function that could potentially be used
 138 for `buffer(%0)`, but those are not considered by One-Shot Bufferize to keep the
 139 bufferization simple. One-Shot Bufferize could be extended to consider such
 140 buffers in the future to achieve a better quality of bufferization.
 141
 142 Tensor ops that are not in destination-passing style always bufferized to a
 143 memory allocation. E.g.:
 144
 145 ```mlir
 146 %0 = tensor.generate %sz {
 147 ^bb0(%i : index):
 148   %cst = arith.constant 0.0 : f32
 149   tensor.yield %cst : f32
 150 } : tensor<?xf32>
 151 ```
 152
 153 The result of `tensor.generate` does not have a "destination" operand, so
 154 bufferization allocates a new buffer. This could be avoided by choosing an
 155 op such as `linalg.generic`, which can express the same computation with a
 156 "destination" operand, as specified behind outputs (`outs`):
 157
 158 ```mlir
 159 #map = affine_map<(i) -> (i)>
 160 %0 = linalg.generic {indexing_maps = [#map], iterator_types = ["parallel"]}
 161                     outs(%t : tensor<?xf32>) {
 162   ^bb0(%arg0 : f32):
 163     %cst = arith.constant 0.0 : f32
 164     linalg.yield %cst : f32
 165 } -> tensor<?xf32>
 166 ```
 167
 168 At first glance, the above `linalg.generic` op may not seem very useful because
 169 the output tensor `%t` is entirely overwritten. Why pass the tensor `%t` as an
 170 operand in the first place? As an example, this can be useful for overwriting a
 171 slice of a tensor:
 172
 173 ```mlir
 174 %t = tensor.extract_slice %s [%idx] [%sz] [1] : tensor<?xf32> to tensor<?xf32>
 175 %0 = linalg.generic ... outs(%t) { ... } -> tensor<?xf32>
 176 %1 = tensor.insert_slice %0 into %s [%idx] [%sz] [1]
 177     : tensor<?xf32> into tensor<?xf32>
 178 ```
 179
 180 The above example bufferizes to a `memref.subview`, followed by a
 181 "`linalg.generic` on memrefs" that overwrites the memory of the subview, assuming
 182 that the slice `%t` has no other user. The `tensor.insert_slice` then bufferizes
 183 to a no-op (in the absence of RaW conflicts such as a subsequent read of `%s`).
 184
 185 RaW conflicts are detected with an analysis of SSA use-def chains (details
 186 later). One-Shot Bufferize works best if there is a single SSA use-def chain,
 187 where the result of a tensor op is the operand of the next tensor ops, e.g.:
 188
 189 ```mlir
 190 %0 = "my_dialect.some_op"(%t) : (tensor<?xf32>) -> (tensor<?xf32>)
 191 %1 = "my_dialect.another_op"(%0) : (tensor<?xf32>) -> (tensor<?xf32>)
 192 %2 = "my_dialect.yet_another_op"(%1) : (tensor<?xf32>) -> (tensor<?xf32>)
 193 ```
 194
 195 Buffer copies are likely inserted if the SSA use-def chain splits at some point,
 196 e.g.:
 197
 198 ```mlir
 199 %0 = "my_dialect.some_op"(%t) : (tensor<?xf32>) -> (tensor<?xf32>)
 200 %1 = "my_dialect.another_op"(%0) : (tensor<?xf32>) -> (tensor<?xf32>)
 201 %2 = "my_dialect.yet_another_op"(%0) : (tensor<?xf32>) -> (tensor<?xf32>)
 202 ```
 203
 204 One-Shot Bufferize has debug flags (`test-analysis-only print-conflicts`) that
 205 print the results of the analysis and explain to the user why buffer copies were
 206 inserted.
 207
 208 ## Using One-Shot Bufferize
 209
 210 MLIR provides a pass
 211 [`-one-shot-bufferize`](https://mlir.llvm.org/docs/Passes/#-one-shot-bufferize-one-shot-bufferize)
 212 that performs an analysis and bufferizes all ops with tensor semantics that
 213 implement `BufferizableOpInterface`. For modularity reasons, these op interface
 214 implementations are typically external models that live in a dialect's
 215 "Transforms" build unit. (External models are a mechanism for implementing an op
 216 interface in a different build unit.) It is the user's responsibility to ensure
 217 that all needed external models are registered before running One-Shot
 218 Bufferize.
 219
 220 By default, One-Shot Bufferize fails when it encounters an op with tensor
 221 semantics (i.e., tensor result or tensor operand) that is not bufferizable
 222 (i.e., does not implement `BufferizableOpInterface`). This can be avoided with
 223 `allow-unknown-ops`. In that case, One-Shot Bufferize inserts
 224 `to_memref`/`to_tensor` ops around the bufferization boundary. These ops are
 225 named versions of `unrealized_conversion_cast`. Note that One-Shot Bufferize's
 226 analysis can currently not analyze these ops, so input IR with such ops may fail
 227 bufferization. Therefore, running One-Shot Bufferize multiple times in a
 228 sequence is also not supported at the moment.
 229
 230 One-Shot Bufferize can be configured to bufferize only ops from a set of
 231 dialects with `dialect-filter`. This can be useful for gradually migrating from
 232 dialect conversion-based bufferization to One-Shot Bufferize. One-Shot Bufferize
 233 must run first in such a case, because dialect conversion-based bufferization
 234 generates `to_tensor`/`to_memref` ops which One-Shot Bufferize cannot analyze.
 235
 236 One-Shot Bufferize can also be called programmatically with
 237 [`bufferization::runOneShotBufferize`](https://github.com/llvm/llvm-project/blob/ae2764e835a26bad9774803eca0a6530df2a3e2d/mlir/include/mlir/Dialect/Bufferization/Transforms/OneShotAnalysis.h#L167).
 238 Alternatively,
 239 [`bufferization::bufferizeOp`](https://github.com/llvm/llvm-project/blob/ae2764e835a26bad9774803eca0a6530df2a3e2d/mlir/include/mlir/Dialect/Bufferization/Transforms/Bufferize.h#L78)
 240 skips the analysis and inserts a copy on every buffer write, just like the
 241 dialect conversion-based bufferization.
 242
 243 ## Buffer Deallocation
 244
 245 **Important: this pass is deprecated, please use the ownership based buffer**
 246 **deallocation pass instead**
 247
 248 One-Shot Bufferize deallocates all buffers that it allocates. This is in
 249 contrast to the dialect conversion-based bufferization that delegates this job
 250 to the
 251 [`-buffer-deallocation`](https://mlir.llvm.org/docs/Passes/#-buffer-deallocation-adds-all-required-dealloc-operations-for-all-allocations-in-the-input-program)
 252 pass. By default, One-Shot Bufferize rejects IR where a newly allocated buffer
 253 is returned from a block. Such IR will fail bufferization.
 254
 255 A new buffer allocation is returned from a block when the result of an op that
 256 is not in destination-passing style is returned. E.g.:
 257
 258 ```mlir
 259 %0 = scf.if %c -> (tensor<?xf32>) {
 260   %1 = tensor.generate ... -> tensor<?xf32>
 261   scf.yield %1 : tensor<?xf32>
 262 } else {
 263   scf.yield %another_tensor : tensor<?xf32>
 264 }
 265 ```
 266
 267 The `scf.yield` in the "else" branch is OK, but the `scf.yield` in the "then"
 268 branch will be rejected.
 269
 270 Another case in which a buffer allocation may be returned is when a buffer copy
 271 must be inserted due to a RaW conflict. E.g.:
 272
 273 ```mlir
 274 %0 = scf.if %c -> (tensor<?xf32>) {
 275   %1 = tensor.insert %cst into %another_tensor[%idx] : tensor<?xf32>
 276   "my_dialect.reading_tensor_op"(%another_tensor) : (tensor<?xf32>) -> ()
 277   ...
 278   scf.yield %1 : tensor<?xf32>
 279 } else {
 280   scf.yield %yet_another_tensor : tensor<?xf32>
 281 }
 282 ```
 283
 284 In the above example, a buffer copy of `buffer(%another_tensor)` (with `%cst`
 285 inserted) is yielded from the "then" branch.
 286
 287 Note: Buffer allocations that are returned from a function are not deallocated.
 288 It is the caller's responsibility to deallocate the buffer. For the full
 289 function boundary ABI for MemRefs w.r.t. buffer deallocation refer to the
 290 [*Function Boundary ABI*](#function-boundary-abi) section. In the future, this
 291 could be automated with allocation hoisting (across function boundaries) or
 292 reference counting.
 293
 294 One-Shot Bufferize leaks all memory and does not generate any buffer
 295 deallocations. The `-buffer-deallocation-pipeline` has to be run afterwards to
 296 insert the deallocation operations.
 297
 298 ## Ownership-based Buffer Deallocation
 299
 300 Recommended compilation pipeline:
 301 ```
 302 one-shot-bufferize
 303        |          it's recommended to perform all bufferization here at latest,
 304        |       <- any allocations inserted after this point have to be handled
 305        V          manually
 306 expand-realloc
 307        V
 308 ownership-based-buffer-deallocation
 309        V
 310   canonicalize <- mostly for scf.if simplifications
 311        V
 312 buffer-deallocation-simplification
 313        V       <- from this point onwards no tensor values are allowed
 314 lower-deallocations
 315        V
 316       CSE
 317        V
 318   canonicalize
 319 ```
 320
 321 One-Shot Bufferize does not deallocate any buffers that it allocates. This job
 322 is delegated to the
 323 [`-ownership-based-buffer-deallocation`](https://mlir.llvm.org/docs/Passes/#-ownership-based-buffer-deallocation)
 324 pass, i.e., after running One-Shot Bufferize, the result IR may have a number of
 325 `memref.alloc` ops, but no `memref.dealloc` ops. This pass processes operations
 326 implementing `FunctionOpInterface` one-by-one without analysing the call-graph.
 327 This means, that there have to be [some rules](#function-boundary-abi) on how
 328 MemRefs are handled when being passed from one function to another. The rest of
 329 the pass revolves heavily around the `bufferization.dealloc` operation which is
 330 inserted at the end of each basic block with appropriate operands and should be
 331 optimized using the Buffer Deallocation Simplification pass
 332 (`--buffer-deallocation-simplification`) and the regular canonicalizer
 333 (`--canonicalize`). Lowering the result of the
 334 `-ownership-based-buffer-deallocation` pass directly using
 335 `--convert-bufferization-to-memref` without beforehand optimization is not
 336 recommended as it will lead to very inefficient code (the runtime-cost of
 337 `bufferization.dealloc` is `O(|memrefs|^2+|memref|*|retained|)`).
 338
 339 ### Function boundary ABI
 340
 341 The Buffer Deallocation pass operates on the level of operations implementing
 342 the `FunctionOpInterface`. Such operations can take MemRefs as arguments, but
 343 also return them. To ensure compatibility among all functions (including
 344 external ones), some rules have to be enforced:
 345 *   When a MemRef is passed as a function argument, ownership is never acquired.
 346     It is always the caller's responsibility to deallocate such MemRefs.
 347 *   Returning a MemRef from a function always passes ownership to the caller,
 348     i.e., it is also the caller's responsibility to deallocate memrefs returned
 349     from a called function.
 350 *   A function must not return a MemRef with the same allocated base buffer as
 351     one of its arguments (in this case a copy has to be created). Note that in
 352     this context two subviews of the same buffer that don't overlap are also
 353     considered to alias.
 354
 355 For external functions (e.g., library functions written externally in C), the
 356 externally provided implementation has to adhere to these rules and they are
 357 just assumed by the buffer deallocation pass. Functions on which the
 358 deallocation pass is applied and the implementation is accessible are modified
 359 by the pass such that the ABI is respected (i.e., buffer copies are inserted as
 360 necessary).
 361
 362 ### Inserting `bufferization.dealloc` operations
 363
 364 `bufferization.dealloc` operations are unconditionally inserted at the end of
 365 each basic block (just before the terminator). The majority of the pass is about
 366 finding the correct operands for this operation. There are three variadic
 367 operand lists to be populated, the first contains all MemRef values that may
 368 need to be deallocated, the second list contains their associated ownership
 369 values (of `i1` type), and the third list contains MemRef values that are still
 370 needed at a later point and should thus not be deallocated. This operation
 371 allows us to deal with any kind of aliasing behavior: it lowers to runtime
 372 aliasing checks when not enough information can be collected statically. When
 373 enough aliasing information is statically available, operands or the entire op
 374 may fold away.
 375
 376 **Ownerships**
 377
 378 To do so, we use a concept of ownership indicators of memrefs which materialize
 379 as an `i1` value for any SSA value of `memref` type, indicating whether the
 380 basic block in which it was materialized has ownership of this MemRef. Ideally,
 381 this is a constant `true` or `false`, but might also be a non-constant SSA
 382 value. To keep track of those ownership values without immediately materializing
 383 them (which might require insertion of `bufferization.clone` operations or
 384 operations checking for aliasing at runtime at positions where we don't actually
 385 need a materialized value), we use the `Ownership` class. This class represents
 386 the ownership in three states forming a lattice on a partial order:
 387 ```
 388 forall X in SSA values. uninitialized < unique(X) < unknown
 389 forall X, Y in SSA values.
 390   unique(X) == unique(Y) iff X and Y always evaluate to the same value
 391   unique(X) != unique(Y) otherwise
 392 ```
 393 Intuitively, the states have the following meaning:
 394 *   Uninitialized: the ownership is not initialized yet, this is the default
 395     state; once an operation is finished processing the ownership of all
 396     operation results with MemRef type should not be uninitialized anymore.
 397 *   Unique: there is a specific SSA value that can be queried to check ownership
 398     without materializing any additional IR
 399 *   Unknown: no specific SSA value is available without materializing additional
 400     IR, typically this is because two ownerships in 'Unique' state would have to
 401     be merged manually (e.g., the result of an `arith.select` either has the
 402     ownership of the then or else case depending on the condition value,
 403     inserting another `arith.select` for the ownership values can perform the
 404     merge and provide a 'Unique' ownership for the result), however, in the
 405     general case this 'Unknown' state has to be assigned.
 406
 407 Implied by the above partial order, the pass combines two ownerships in the
 408 following way:
 409
 410 | Ownership 1   | Ownership 2   | Combined Ownership |
 411 |:--------------|:--------------|:-------------------|
 412 | uninitialized | uninitialized | uninitialized      |
 413 | unique(X)     | uninitialized | unique(X)          |
 414 | unique(X)     | unique(X)     | unique(X)          |
 415 | unique(X)     | unique(Y)     | unknown            |
 416 | unknown       | unique        | unknown            |
 417 | unknown       | uninitialized | unknown            |
 418 | <td colspan=3> + symmetric cases                   |
 419
 420 **Collecting the list of MemRefs that potentially need to be deallocated**
 421
 422 For a given block, the list of MemRefs that potentially need to be deallocated
 423 at the end of that block is computed by keeping track of all values for which
 424 the block potentially takes over ownership. This includes MemRefs provided as
 425 basic block arguments, interface handlers for operations like `memref.alloc` and
 426 `func.call`, but also liveness information in regions with multiple basic
 427 blocks.  More concretely, it is computed by taking the MemRefs in the 'in' set
 428 of the liveness analysis of the current basic block B, appended by the MemRef
 429 block arguments and by the set of MemRefs allocated in B itself (determined by
 430 the interface handlers), then subtracted (also determined by the interface
 431 handlers) by the set of MemRefs deallocated in B.
 432
 433 Note that we don't have to take the intersection of the liveness 'in' set with
 434 the 'out' set of the predecessor block because a value that is in the 'in' set
 435 must be defined in an ancestor block that dominates all direct predecessors and
 436 thus the 'in' set of this block is a subset of the 'out' sets of each
 437 predecessor.
 438
 439 ```
 440 memrefs = filter((liveIn(block) U
 441   allocated(block) U arguments(block)) \ deallocated(block), isMemRef)
 442 ```
 443
 444 The list of conditions for the second variadic operands list of
 445 `bufferization.dealloc` is computed by querying the stored ownership value for
 446 each of the MemRefs collected as described above. The ownership state is updated
 447 by the interface handlers while processing the basic block.
 448
 449 **Collecting the list of MemRefs to retain**
 450
 451 Given a basic block B, the list of MemRefs that have to be retained can be
 452 different for each successor block S.  For the two basic blocks B and S and the
 453 values passed via block arguments to the destination block S, we compute the
 454 list of MemRefs that have to be retained in B by taking the MemRefs in the
 455 successor operand list of the terminator and the MemRefs in the 'out' set of the
 456 liveness analysis for B intersected with the 'in' set of the destination block
 457 S.
 458
 459 This list of retained values makes sure that we cannot run into use-after-free
 460 situations even if no aliasing information is present at compile-time.
 461
 462 ```
 463 toRetain = filter(successorOperands + (liveOut(fromBlock) insersect
 464   liveIn(toBlock)), isMemRef)
 465 ```
 466
 467 ### Supported interfaces
 468
 469 The pass uses liveness analysis and a few interfaces:
 470 *   `FunctionOpInterface`
 471 *   `CallOpInterface`
 472 *   `MemoryEffectOpInterface`
 473 *   `RegionBranchOpInterface`
 474 *   `RegionBranchTerminatorOpInterface`
 475
 476 Due to insufficient information provided by the interface, it also special-cases
 477 on the `cf.cond_br` operation and makes some assumptions about operations
 478 implementing the `RegionBranchOpInterface` at the moment, but improving the
 479 interfaces would allow us to remove those dependencies in the future.
 480
 481 ### Limitations
 482
 483 The Buffer Deallocation pass has some requirements and limitations on the input
 484 IR. These are checked in the beginning of the pass and errors are emitted
 485 accordingly:
 486 *   The set of interfaces the pass operates on must be implemented (correctly).
 487     E.g., if there is an operation present with a nested region, but does not
 488     implement the `RegionBranchOpInterface`, an error is emitted because the
 489     pass cannot know the semantics of the nested region (and does not make any
 490     default assumptions on it).
 491 *   No explicit control-flow loops are present. Currently, only loops using
 492     structural-control-flow are supported.  However, this limitation could be
 493     lifted in the future.
 494 *   Deallocation operations should not be present already. The pass should
 495     handle them correctly already (at least in most cases), but it's not
 496     supported yet due to insufficient testing.
 497 *   Terminators must implement either `RegionBranchTerminatorOpInterface` or
 498     `BranchOpInterface`, but not both. Terminators with more than one successor
 499     are not supported (except `cf.cond_br`). This is not a fundamental
 500     limitation, but there is no use-case justifying the more complex
 501     implementation at the moment.
 502
 503 ### Example
 504
 505 The following example contains a few interesting cases:
 506 *   Basic block arguments are modified to also pass along the ownership
 507     indicator, but not for entry bocks of non-private functions (assuming the
 508     `private-function-dynamic-ownership` pass option is disabled) where the
 509     function boundary ABI is applied instead. "Private" in this context refers
 510     to functions that cannot be called externally.
 511 *   The result of `arith.select` initially has 'Unknown' assigned as ownership,
 512     but once the `bufferization.dealloc` operation is inserted it is put in the
 513     'retained' list (since it has uses in a later basic block) and thus the
 514     'Unknown' ownership can be replaced with a 'Unique' ownership using the
 515     corresponding result of the dealloc operation.
 516 *   The `cf.cond_br` operation has more than one successor and thus has to
 517     insert two `bufferization.dealloc` operations (one for each successor).
 518     While they have the same list of MemRefs to deallocate (because they perform
 519     the deallocations for the same block), it must be taken into account that
 520     some MemRefs remain *live* for one branch but not the other (thus set
 521     intersection is performed on the *live-out* of the current block and the
 522     *live-in* of the target block). Also, `cf.cond_br` supports separate
 523     forwarding operands for each successor. To make sure that no MemRef is
 524     deallocated twice (because there are two `bufferization.dealloc` operations
 525     with the same MemRefs to deallocate), the condition operands are adjusted to
 526     take the branch condition into account. While a generic lowering for such
 527     terminator operations could be implemented, a specialized implementation can
 528     take all the semantics of this particular operation into account and thus
 529     generate a more efficient lowering.
 530
 531 ```mlir
 532 func.func @example(%memref: memref<?xi8>, %select_cond: i1, %br_cond: i1) {
 533   %alloc = memref.alloc() : memref<?xi8>
 534   %alloca = memref.alloca() : memref<?xi8>
 535   %select = arith.select %select_cond, %alloc, %alloca : memref<?xi8>
 536   cf.cond_br %br_cond, ^bb1(%alloc : memref<?xi8>), ^bb1(%memref : memref<?xi8>)
 537 ^bb1(%bbarg: memref<?xi8>):
 538   test.copy(%bbarg, %select) : (memref<?xi8>, memref<?xi8>)
 539   return
 540 }
 541 ```
 542
 543 After running `--ownership-based-buffer-deallocation`, it looks as follows:
 544
 545 ```mlir
 546 // Since this is not a private function, the signature will not be modified even
 547 // when private-function-dynamic-ownership is enabled. Instead the function
 548 // boundary ABI has to be applied which means that ownership of `%memref` will
 549 // never be acquired.
 550 func.func @example(%memref: memref<?xi8>, %select_cond: i1, %br_cond: i1) {
 551   %false = arith.constant false
 552   %true = arith.constant true
 553
 554   // The ownership of a MemRef defined by the `memref.alloc` operation is always
 555   // assigned to be 'true'.
 556   %alloc = memref.alloc() : memref<?xi8>
 557
 558   // The ownership of a MemRef defined by the `memref.alloca` operation is
 559   // always assigned to be 'false'.
 560   %alloca = memref.alloca() : memref<?xi8>
 561
 562   // The ownership of %select will be the join of the ownership of %alloc and
 563   // the ownership of %alloca, i.e., of %true and %false. Because the pass does
 564   // not know about the semantics of the `arith.select` operation (unless a
 565   // custom handler is implemented), the ownership join will be 'Unknown'. If
 566   // the materialized ownership indicator of %select is needed, either a clone
 567   // has to be created for which %true is assigned as ownership or the result
 568   // of a `bufferization.dealloc` where %select is in the retain list has to be
 569   // used.
 570   %select = arith.select %select_cond, %alloc, %alloca : memref<?xi8>
 571
 572   // We use `memref.extract_strided_metadata` to get the base memref since it is
 573   // not allowed to pass arbitrary memrefs to `memref.dealloc`. This property is
 574   // already enforced for `bufferization.dealloc`
 575   %base_buffer_memref, ... = memref.extract_strided_metadata %memref
 576     : memref<?xi8> -> memref<i8>, index, index, index
 577   %base_buffer_alloc, ... = memref.extract_strided_metadata %alloc
 578     : memref<?xi8> -> memref<i8>, index, index, index
 579   %base_buffer_alloca, ... = memref.extract_strided_metadata %alloca
 580     : memref<?xi8> -> memref<i8>, index, index, index
 581
 582   // The deallocation conditions need to be adjusted to incorporate the branch
 583   // condition. In this example, this requires only a single negation, but might
 584   // also require multiple arith.andi operations.
 585   %not_br_cond = arith.xori %true, %br_cond : i1
 586
 587   // There are two dealloc operations inserted in this basic block, one per
 588   // successor. Both have the same list of MemRefs to deallocate and the
 589   // conditions only differ by the branch condition conjunct.
 590   // Note, however, that the retained list differs. Here, both contain the
 591   // %select value because it is used in both successors (since it's the same
 592   // block), but the value passed via block argument differs (%memref vs.
 593   // %alloc).
 594   %10:2 = bufferization.dealloc
 595            (%base_buffer_memref, %base_buffer_alloc, %base_buffer_alloca
 596              : memref<i8>, memref<i8>, memref<i8>)
 597         if (%false, %br_cond, %false)
 598     retain (%alloc, %select : memref<?xi8>, memref<?xi8>)
 599
 600   %11:2 = bufferization.dealloc
 601            (%base_buffer_memref, %base_buffer_alloc, %base_buffer_alloca
 602              : memref<i8>, memref<i8>, memref<i8>)
 603         if (%false, %not_br_cond, %false)
 604     retain (%memref, %select : memref<?xi8>, memref<?xi8>)
 605
 606   // Because %select is used in ^bb1 without passing it via block argument, we
 607   // need to update it's ownership value here by merging the ownership values
 608   // returned by the dealloc operations
 609   %new_ownership = arith.select %br_cond, %10#1, %11#1 : i1
 610
 611   // The terminator is modified to pass along the ownership indicator values
 612   // with each MemRef value.
 613   cf.cond_br %br_cond, ^bb1(%alloc, %10#0 : memref<?xi8>, i1),
 614                        ^bb1(%memref, %11#0 : memref<?xi8>, i1)
 615
 616 // All non-entry basic blocks are modified to have an additional i1 argument for
 617 // each MemRef value in the argument list.
 618 ^bb1(%13: memref<?xi8>, %14: i1):  // 2 preds: ^bb0, ^bb0
 619   test.copy(%13, %select) : (memref<?xi8>, memref<?xi8>)
 620
 621   %base_buffer_13, ... = memref.extract_strided_metadata %13
 622     : memref<?xi8> -> memref<i8>, index, index, index
 623   %base_buffer_select, ... = memref.extract_strided_metadata %select
 624     : memref<?xi8> -> memref<i8>, index, index, index
 625
 626   // Here, we don't have a retained list, because the block has no successors
 627   // and the return has no operands.
 628   bufferization.dealloc (%base_buffer_13, %base_buffer_select
 629                           : memref<i8>, memref<i8>)
 630                      if (%14, %new_ownership)
 631   return
 632 }
 633 ```
 634
 635 ## Buffer Deallocation Simplification Pass
 636
 637 The [semantics of the `bufferization.dealloc` operation](https://mlir.llvm.org/docs/Dialects/BufferizationOps/#bufferizationdealloc-bufferizationdeallocop)
 638 provide a lot of opportunities for optimizations which can be conveniently split
 639 into patterns using the greedy pattern rewriter. Some of those patterns need
 640 access to additional analyses such as an analysis that can determine whether two
 641 MemRef values must, may, or never originate from the same buffer allocation.
 642 These patterns are collected in the Buffer Deallocation Simplification pass,
 643 while patterns that don't need additional analyses are registered as part of the
 644 regular canonicalizer pass. This pass is best run after
 645 `--ownership-based-buffer-deallocation` followed by `--canonicalize`.
 646
 647 The pass applies patterns for the following simplifications:
 648 *   Remove MemRefs from retain list when guaranteed to not alias with any value
 649     in the 'memref' operand list. This avoids an additional aliasing check with
 650     the removed value.
 651 *   Split off values in the 'memref' list to new `bufferization.dealloc`
 652     operations only containing this value in the 'memref' list when it is
 653     guaranteed to not alias with any other value in the 'memref' list. This
 654     avoids at least one aliasing check at runtime and enables using a more
 655     efficient lowering for this new `bufferization.dealloc` operation.
 656 *   Remove values from the 'memref' operand list when it is guaranteed to alias
 657     with at least one value in the 'retained' list and may not alias any other
 658     value in the 'retain' list.
 659
 660 ## Lower Deallocations Pass
 661
 662 The `-lower-deallocations` pass transforms all `bufferization.dealloc`
 663 operations to `memref.dealloc` operations and may also insert operations from
 664 the `scf`, `func`, and `arith` dialects to make deallocations conditional and
 665 check whether two MemRef values come from the same allocation at runtime (when
 666 the `buffer-deallocation-simplification` pass wasn't able to determine it
 667 statically).
 668
 669 The same lowering of the `bufferization.dealloc` operation is also part of the
 670 `-convert-bufferization-to-memref` conversion pass which also lowers all the
 671 other operations of the bufferization dialect.
 672
 673 We distinguish multiple cases in this lowering pass to provide an overall more
 674 efficient lowering. In the general case, a library function is created to avoid
 675 quadratic code size explosion (relative to the number of operands of the dealloc
 676 operation). The specialized lowerings aim to avoid this library function because
 677 it requires allocating auxiliary MemRefs of index values.
 678
 679 ### Generic Lowering
 680
 681 A library function is generated to avoid code-size blow-up. On a high level, the
 682 base-memref of all operands is extracted as an index value and stored into
 683 specifically allocated MemRefs and passed to the library function which then
 684 determines whether they come from the same original allocation. This information
 685 is needed to avoid double-free situations and to correctly retain the MemRef
 686 values in the `retained` list.
 687
 688 **Dealloc Operation Lowering**
 689
 690 This lowering supports all features the dealloc operation has to offer. It
 691 computes the base pointer of each memref (as an index), stores it in a
 692 new memref helper structure and passes it to the helper function generated
 693 in `buildDeallocationLibraryFunction`. The results are stored in two lists
 694 (represented as MemRefs) of booleans passed as arguments. The first list
 695 stores whether the corresponding condition should be deallocated, the
 696 second list stores the ownership of the retained values which can be used
 697 to replace the result values of the `bufferization.dealloc` operation.
 698
 699 Example:
 700 ```
 701 %0:2 = bufferization.dealloc (%m0, %m1 : memref<2xf32>, memref<5xf32>)
 702                           if (%cond0, %cond1)
 703                       retain (%r0, %r1 : memref<1xf32>, memref<2xf32>)
 704 ```
 705 lowers to (simplified):
 706 ```
 707 %c0 = arith.constant 0 : index
 708 %c1 = arith.constant 1 : index
 709 %dealloc_base_pointer_list = memref.alloc() : memref<2xindex>
 710 %cond_list = memref.alloc() : memref<2xi1>
 711 %retain_base_pointer_list = memref.alloc() : memref<2xindex>
 712 %m0_base_pointer = memref.extract_aligned_pointer_as_index %m0
 713 memref.store %m0_base_pointer, %dealloc_base_pointer_list[%c0]
 714 %m1_base_pointer = memref.extract_aligned_pointer_as_index %m1
 715 memref.store %m1_base_pointer, %dealloc_base_pointer_list[%c1]
 716 memref.store %cond0, %cond_list[%c0]
 717 memref.store %cond1, %cond_list[%c1]
 718 %r0_base_pointer = memref.extract_aligned_pointer_as_index %r0
 719 memref.store %r0_base_pointer, %retain_base_pointer_list[%c0]
 720 %r1_base_pointer = memref.extract_aligned_pointer_as_index %r1
 721 memref.store %r1_base_pointer, %retain_base_pointer_list[%c1]
 722 %dyn_dealloc_base_pointer_list = memref.cast %dealloc_base_pointer_list :
 723    memref<2xindex> to memref<?xindex>
 724 %dyn_cond_list = memref.cast %cond_list : memref<2xi1> to memref<?xi1>
 725 %dyn_retain_base_pointer_list = memref.cast %retain_base_pointer_list :
 726    memref<2xindex> to memref<?xindex>
 727 %dealloc_cond_out = memref.alloc() : memref<2xi1>
 728 %ownership_out = memref.alloc() : memref<2xi1>
 729 %dyn_dealloc_cond_out = memref.cast %dealloc_cond_out :
 730    memref<2xi1> to memref<?xi1>
 731 %dyn_ownership_out = memref.cast %ownership_out :
 732    memref<2xi1> to memref<?xi1>
 733 call @dealloc_helper(%dyn_dealloc_base_pointer_list,
 734                      %dyn_retain_base_pointer_list,
 735                      %dyn_cond_list,
 736                      %dyn_dealloc_cond_out,
 737                      %dyn_ownership_out) : (...)
 738 %m0_dealloc_cond = memref.load %dyn_dealloc_cond_out[%c0] : memref<2xi1>
 739 scf.if %m0_dealloc_cond {
 740   memref.dealloc %m0 : memref<2xf32>
 741 }
 742 %m1_dealloc_cond = memref.load %dyn_dealloc_cond_out[%c1] : memref<2xi1>
 743 scf.if %m1_dealloc_cond {
 744   memref.dealloc %m1 : memref<5xf32>
 745 }
 746 %r0_ownership = memref.load %dyn_ownership_out[%c0] : memref<2xi1>
 747 %r1_ownership = memref.load %dyn_ownership_out[%c1] : memref<2xi1>
 748 memref.dealloc %dealloc_base_pointer_list : memref<2xindex>
 749 memref.dealloc %retain_base_pointer_list : memref<2xindex>
 750 memref.dealloc %cond_list : memref<2xi1>
 751 memref.dealloc %dealloc_cond_out : memref<2xi1>
 752 memref.dealloc %ownership_out : memref<2xi1>
 753 // replace %0#0 with %r0_ownership
 754 // replace %0#1 with %r1_ownership
 755 ```
 756
 757 **Library function**
 758
 759 A library function is built per compilation unit that can be called at
 760 bufferization dealloc sites to determine whether two MemRefs come from the same
 761 allocation and their new ownerships.
 762
 763 The generated function takes two MemRefs of indices and three MemRefs of
 764 booleans as arguments:
 765   * The first argument A should contain the result of the
 766   extract_aligned_pointer_as_index operation applied to the MemRefs to be
 767   deallocated
 768   * The second argument B should contain the result of the
 769   extract_aligned_pointer_as_index operation applied to the MemRefs to be
 770   retained
 771   * The third argument C should contain the conditions as passed directly
 772   to the deallocation operation.
 773   * The fourth argument D is used to pass results to the caller. Those
 774   represent the condition under which the MemRef at the corresponding
 775   position in A should be deallocated.
 776   * The fifth argument E is used to pass results to the caller. It
 777   provides the ownership value corresponding the the MemRef at the same
 778   position in B
 779
 780 This helper function is supposed to be called once for each
 781 `bufferization.dealloc` operation to determine the deallocation need and
 782 new ownership indicator for the retained values, but does not perform the
 783 deallocation itself.
 784
 785 Generated code:
 786 ```
 787 func.func @dealloc_helper(
 788     %dyn_dealloc_base_pointer_list: memref<?xindex>,
 789     %dyn_retain_base_pointer_list: memref<?xindex>,
 790     %dyn_cond_list: memref<?xi1>,
 791     %dyn_dealloc_cond_out: memref<?xi1>,
 792     %dyn_ownership_out: memref<?xi1>) {
 793   %c0 = arith.constant 0 : index
 794   %c1 = arith.constant 1 : index
 795   %true = arith.constant true
 796   %false = arith.constant false
 797   %num_dealloc_memrefs = memref.dim %dyn_dealloc_base_pointer_list, %c0
 798   %num_retain_memrefs = memref.dim %dyn_retain_base_pointer_list, %c0
 799   // Zero initialize result buffer.
 800   scf.for %i = %c0 to %num_retain_memrefs step %c1 {
 801     memref.store %false, %dyn_ownership_out[%i] : memref<?xi1>
 802   }
 803   scf.for %i = %c0 to %num_dealloc_memrefs step %c1 {
 804     %dealloc_bp = memref.load %dyn_dealloc_base_pointer_list[%i]
 805     %cond = memref.load %dyn_cond_list[%i]
 806     // Check for aliasing with retained memrefs.
 807     %does_not_alias_retained = scf.for %j = %c0 to %num_retain_memrefs
 808         step %c1 iter_args(%does_not_alias_aggregated = %true) -> (i1) {
 809       %retain_bp = memref.load %dyn_retain_base_pointer_list[%j]
 810       %does_alias = arith.cmpi eq, %retain_bp, %dealloc_bp : index
 811       scf.if %does_alias {
 812         %curr_ownership = memref.load %dyn_ownership_out[%j]
 813         %updated_ownership = arith.ori %curr_ownership, %cond : i1
 814         memref.store %updated_ownership, %dyn_ownership_out[%j]
 815       }
 816       %does_not_alias = arith.cmpi ne, %retain_bp, %dealloc_bp : index
 817       %updated_aggregate = arith.andi %does_not_alias_aggregated,
 818                                       %does_not_alias : i1
 819       scf.yield %updated_aggregate : i1
 820     }
 821     // Check for aliasing with dealloc memrefs in the list before the
 822     // current one, i.e.,
 823     // `fix i, forall j < i: check_aliasing(%dyn_dealloc_base_pointer[j],
 824     // %dyn_dealloc_base_pointer[i])`
 825     %does_not_alias_any = scf.for %j = %c0 to %i step %c1
 826        iter_args(%does_not_alias_agg = %does_not_alias_retained) -> (i1) {
 827       %prev_dealloc_bp = memref.load %dyn_dealloc_base_pointer_list[%j]
 828       %does_not_alias = arith.cmpi ne, %prev_dealloc_bp, %dealloc_bp
 829       %updated_alias_agg = arith.andi %does_not_alias_agg, %does_not_alias
 830       scf.yield %updated_alias_agg : i1
 831     }
 832     %dealloc_cond = arith.andi %does_not_alias_any, %cond : i1
 833     memref.store %dealloc_cond, %dyn_dealloc_cond_out[%i] : memref<?xi1>
 834   }
 835   return
 836 }
 837 ```
 838
 839 ### Specialized Lowerings
 840
 841 Currently, there are two special lowerings for common cases to avoid the library
 842 function and thus unnecessary memory load and store operations and function
 843 calls:
 844
 845 **One memref, no retained**
 846
 847 Lower a simple case without any retained values and a single MemRef. Ideally,
 848 static analysis can provide enough information such that the
 849 `buffer-deallocation-simplification` pass is able to split the dealloc
 850 operations up into this simple case as much as possible before running this
 851 pass.
 852
 853 Example:
 854 ```mlir
 855 bufferization.dealloc (%arg0 : memref<2xf32>) if (%arg1)
 856 ```
 857 is lowered to
 858 ```mlir
 859 scf.if %arg1 {
 860   memref.dealloc %arg0 : memref<2xf32>
 861 }
 862 ```
 863
 864 In most cases, the branch condition is either constant 'true' or 'false' and can
 865 thus be optimized away entirely by the canonicalizer pass.
 866
 867 **One memref, arbitrarily many retained**
 868
 869 A special case lowering for the deallocation operation with exactly one MemRef,
 870 but an arbitrary number of retained values. The size of the code produced by
 871 this lowering is linear to the number of retained values.
 872
 873 Example:
 874 ```mlir
 875 %0:2 = bufferization.dealloc (%m : memref<2xf32>) if (%cond)
 876                       retain (%r0, %r1 : memref<1xf32>, memref<2xf32>)
 877 return %0#0, %0#1 : i1, i1
 878 ```
 879 is lowered to
 880 ```mlir
 881 %m_base_pointer = memref.extract_aligned_pointer_as_index %m
 882 %r0_base_pointer = memref.extract_aligned_pointer_as_index %r0
 883 %r0_does_not_alias = arith.cmpi ne, %m_base_pointer, %r0_base_pointer
 884 %r1_base_pointer = memref.extract_aligned_pointer_as_index %r1
 885 %r1_does_not_alias = arith.cmpi ne, %m_base_pointer, %r1_base_pointer
 886 %not_retained = arith.andi %r0_does_not_alias, %r1_does_not_alias : i1
 887 %should_dealloc = arith.andi %not_retained, %cond : i1
 888 scf.if %should_dealloc {
 889   memref.dealloc %m : memref<2xf32>
 890 }
 891 %true = arith.constant true
 892 %r0_does_alias = arith.xori %r0_does_not_alias, %true : i1
 893 %r0_ownership = arith.andi %r0_does_alias, %cond : i1
 894 %r1_does_alias = arith.xori %r1_does_not_alias, %true : i1
 895 %r1_ownership = arith.andi %r1_does_alias, %cond : i1
 896 return %r0_ownership, %r1_ownership : i1, i1
 897 ```
 898
 899 ## Memory Layouts
 900
 901 One-Shot Bufferize bufferizes ops from top to bottom. This works well when all
 902 ops are bufferizable. However, when encountering a non-bufferizable tensor with
 903 `allow-unknown-ops`, One-Shot Bufferize must insert `to_memref` ops at the
 904 bufferization boundary and decide on a memref type. By default, One-Shot
 905 Bufferize choose the most dynamic memref type wrt. layout maps. E.g.:
 906
 907 ```mlir
 908 %0 = "my_dialect.unbufferizable_op(%t) : (tensor<?x?xf32>) -> (tensor<?x?xf32>)
 909 %1 = tensor.extract %0[%idx1, %idx2] : tensor<?xf32>
 910 ```
 911
 912 When bufferizing the above IR, One-Shot Bufferize inserts a `to_memref` ops with
 913 dynamic offset and strides:
 914
 915 ```mlir
 916 %0 = "my_dialect.unbufferizable_op(%t) : (tensor<?x?xf32>) -> (tensor<?x?xf32>)
 917 %0_m = bufferization.to_memref %0 : memref<?x?xf32, strided<[?, ?], offset: ?>>
 918 %1 = memref.load %0_m[%idx1, %idx2] : memref<?x?xf32, strided<[?, ?], offset: ?>>
 919 ```
 920
 921 All users of `%0` have fully dynamic layout maps. This ensures that the
 922 bufferized IR composes well with future bufferizations of `unbufferizable_op`
 923 (maybe bufferized by another pass), regardless of the exact memref type of the
 924 future bufferization. If the op turns out to be bufferized to an op with a
 925 simpler memref type (e.g., identity layout map), we expect that canonicalization
 926 patterns would clean up unnecessarily dynamic layout maps. (Some of these
 927 canonicalization patterns may not be implemented yet.)
 928
 929 One-Shot Bufferize tries to infer the most precise memref type when bufferizing
 930 an op. If the entire IR is bufferizable, we do not have to resort to
 931 conservatively use fully dynamic layout maps. In that case, we also do not have
 932 to rely on canonicalization patterns to clean up the bufferized IR.
 933
 934 Note: There are some bufferizable ops for which a percise layout map cannot be
 935 inferred. E.g., a `tensor.cast` from a `tensor<*xf32>` to a `tensor<?x?xf32>`
 936 must be bufferized to a `memref.cast` with a memref type that has a fully
 937 dynamic layout map.
 938
 939 One-Shot Bufferize has an option `unknown-type-conversion` to control the
 940 generation of layout maps when no precise layout can be inferred:
 941
 942 *   `fully-dynamic-layout-map` uses fully dynamic layout maps and is the default
 943     behavior. This composes well when IR is partially bufferized.
 944 *   `identity-layout-map` uses static identity layout maps. This option can be
 945     useful for legacy code that cannot handle memref types with layout maps.
 946     Note that this setting can lead to additional buffer copies when folding a
 947     `to_tensor`/`to_memref` pair with memref types that are not cast-compatible.
 948
 949 Note: The `unknown-type-conversion` option does not affect layout maps of
 950 function signatures. There is a separate `function-signature-type-conversion`
 951 option that controls layout maps of function parameters and function results.
 952
 953 ## Extending One-Shot Bufferize
 954
 955 Custom ops can be bufferized if they implement `BufferizableOpInterface`. Users
 956 must at least implement the following interface methods.
 957
 958 *   `bufferizesToMemoryRead`: Return `true` if the buffer of the given tensor
 959     OpOperand is read.
 960 *   `bufferizesToMemoryWrite`: Return `true` if the buffer of the given tensor
 961     OpOperand is written (if bufferizing in-place).
 962 *   `getAliasingOpResult`: Return the OpResults that may share the same buffer
 963     as the given OpOperand. This interface method describes to
 964     OpOperand-to-OpResult mapping wrt. destination-passing style.
 965 *   `bufferRelation`: Return `BufferRelation::Equivalent` if the given OpResult
 966     is the exact same memref as the aliasing OpOperand after bufferization (in
 967     case of in-place bufferization). Otherwise, (e.g., they overlap but are not
 968     necessarily the exact same memrefs), `BufferRelation::Unknown` should be
 969     returned. Additional buffer relations will be added in the future, but
 970     `BufferRelation::Unknown` is always safe.
 971 *   `bufferize`: Rewrite the op with the given rewriter. Ops should be replaced
 972     with `bufferization::replaceOpWithBufferizedValues`.
 973
 974 To get a better intuition of the interface methods, we invite users to take a
 975 look at existing implementations in MLIR, e.g., the implementation of
 976 `tensor.insert` or `tensor.extract`.
 977
 978 ## Debugging Buffer Copies
 979
 980 To get a better understanding of why One-Shot Bufferize introduced a buffer
 981 copy, users can run the pass with `test-analysis-only print-conflicts`. Every
 982 tensor op is then annotated with an attribute that has a boolean value for each
 983 tensor OpOperand. `true` means that the OpOperand bufferizes in-place. `false`
 984 means that the OpOperand bufferizes out-of-place and a buffer copy will be
 985 inserted.
 986
 987 There are two reasons why a buffer copy may be inserted.
 988
 989 1.  Due to a RaW conflict, it is not safe to bufferize in-place. I.e., the
 990     overwritten data is still needed.
 991 2.  The buffer is not writable. E.g., `memref.global` buffers that are the
 992     result of `arith.constant` ops are never modified.
 993
 994 In the first case, `print-conflicts` illustrates the conflict in the form of a
 995 ("read", "conflicting write", "last write") tuple.
 996
 997 ## Understanding the SSA Use-Def Chain Analysis
 998
 999 To get a better understanding of the SSA Use-Def Chain Analysis and the RaW
1000 conflict detection algorithm, we invite interested users to read the
1001 [design document](https://discourse.llvm.org/uploads/short-url/5kckJ3DftYwQokG252teFgw3sYa.pdf)
1002 and watch the corresponding [ODM talk](https://youtu.be/TXEo59CYS9A)
1003 ([slides](https://mlir.llvm.org/OpenMeetings/2022-01-13-One-Shot-Bufferization.pdf)).
1004 can be used to bufferize a program in a single pass, as long as each op
1005
1006 ## Migrating from Dialect Conversion-based Bufferization
1007
1008 Both dialect conversion-based bufferization and One-Shot Bufferize generate
1009 `to_tensor`/`to_memref` ops at the bufferization boundary (when run with
1010 `allow-unknown-ops`). They can be combined and run in sequence. However,
1011 One-Shot Bufferize must run first because it cannot analyze those boundary ops.
1012 To update existing code step-by-step, it may be useful to specify a dialect
1013 filter for One-Shot Bufferize, so that dialects can be switched over one-by-one.
1014
1015 ## Bufferization Function Graphs
1016
1017 One-Shot Bufferize does currently not support function graph bufferization.
1018 I.e., `CallOp`, `ReturnOp` and function bbArgs are not bufferizable. Users can
1019 run the existing `--func-bufferize` bufferization pass after One-Shot Bufferize.
1020
1021 Alternatively, users can try
1022 [`ModuleBufferization`](https://github.com/llvm/llvm-project/blob/ae2764e835a26bad9774803eca0a6530df2a3e2d/mlir/include/mlir/Dialect/Linalg/ComprehensiveBufferize/ModuleBufferization.h#L31),
1023 which is an extension of One-Shot Bufferize. This bufferization is still under
1024 development and does not support arbitrary IR. In essence, returning a tensor
1025 from a function is not supported, unless it is equivalent to a function bbArg.
1026 In that case, the corresponding return value can simply be dropped during
1027 bufferization.
1028
1029 ## Dialect Conversion-based Bufferization
1030
1031 Disclaimer: Most dialect conversion-based bufferization has been migrated to
1032 One-Shot Bufferize. New users should use One-Shot Bufferize (with or without
1033 analysis). The following documentation is only for existing users of dialect
1034 conversion-based bufferization.
1035
1036 This system is a simple application of MLIR's dialect conversion infrastructure.
1037 The bulk of the code related to bufferization is a set of ordinary
1038 `ConversionPattern`'s that dialect authors write for converting ops that operate
1039 on `tensor`'s to ops that operate on `memref`'s. A set of conventions and best
1040 practices are followed that allow these patterns to be run across multiple
1041 independent passes (rather than requiring a single huge atomic conversion pass),
1042 which makes the compilation pipelines scalable, robust, and easy to debug.
1043
1044 This document is targeted at people looking to utilize MLIR's bufferization
1045 functionality, along with people who want to extend it to cover their own ops.
1046
1047 <a name="the-talk">**NOTE:**</a> Before reading this document, please watch the
1048 talk "Type Conversions the Not-So-Hard-Way: MLIR's New Bufferization
1049 Infrastructure"
1050 ([slides](https://drive.google.com/file/d/1FVbzCXxZzS9LBLuvpPNLWJD-XDkt54ky/view?usp=sharing),
1051 [recording](https://drive.google.com/file/d/1VfVajitgf8ZPnd-HRkJvaJiFLhBsluXN/view?usp=sharing)).
1052 That talk gives a high-level overview of the bufferization infrastructure and
1053 important conceptual details related to using the MLIR dialect conversion
1054 infrastructure.
1055
1056 ### Bufferization's place in a compilation pipeline
1057
1058 Bufferization itself does not free any of the buffers that have been allocated,
1059 nor does it do anything particularly intelligent with the placement of buffers
1060 w.r.t. control flow. Thus, a realistic compilation pipeline will usually consist
1061 of:
1062
1063 1.  Bufferization
1064 1.  Buffer optimizations such as `buffer-hoisting`, `buffer-loop-hoisting`, and
1065     `promote-buffers-to-stack`, which do optimizations that are only exposed
1066     after bufferization.
1067 1.  Finally, running the [buffer deallocation](BufferDeallocationInternals.md)
1068     pass.
1069
1070 After buffer deallocation has been completed, the program will be quite
1071 difficult to transform due to the presence of the deallocation ops. Thus, other
1072 optimizations such as linalg fusion on memrefs should be done before that stage.
1073
1074 ### General structure of the bufferization process
1075
1076 Bufferization consists of running multiple *partial* bufferization passes,
1077 followed by one *finalizing* bufferization pass.
1078
1079 There is typically one partial bufferization pass per dialect (though other
1080 subdivisions are possible). For example, for a dialect `X` there will typically
1081 be a pass `X-bufferize` that knows how to bufferize all the ops in that dialect.
1082 By running pass `X-bufferize` for each dialect `X` in the program, all the ops
1083 in the program are incrementally bufferized.
1084
1085 Partial bufferization passes create programs where only some ops have been
1086 bufferized. These passes will create *materializations* (also sometimes called
1087 "casts") that convert between the `tensor` and `memref` type, which allows
1088 bridging between ops that have been bufferized and ops that have not yet been
1089 bufferized.
1090
1091 Finalizing bufferizations complete the bufferization process, and guarantee that
1092 there are no tensors remaining in the program. This involves eliminating the
1093 materializations. The pass `finalizing-bufferize` provides a minimal pass that
1094 only eliminates materializations and issues an error if any unbufferized ops
1095 exist in the program.
1096
1097 However, it is possible for a finalizing bufferization to do more than just
1098 eliminate materializations. By adding patterns (just as a partial bufferization
1099 would), it is possible for a finalizing bufferization pass to simultaneously
1100 bufferize ops and eliminate materializations. This has a number of disadvantages
1101 discussed in the talk and should generally be avoided.
1102
1103 ### Example
1104
1105 As a concrete example, we will look at the bufferization pipeline from the
1106 `mlir-npcomp` reference backend
1107 ([code](https://github.com/llvm/mlir-npcomp/blob/97d6d04d41216e73d40b89ffd79620973fc14ce3/lib/RefBackend/RefBackend.cpp#L232)).
1108 The code, slightly simplified and annotated, is reproduced here:
1109
1110 ```c++
1111   // Partial bufferization passes.
1112   pm.addPass(createTensorConstantBufferizePass());
1113   pm.addNestedPass<func::FuncOp>(createTCPBufferizePass()); // Bufferizes the downstream `tcp` dialect.
1114   pm.addNestedPass<func::FuncOp>(createSCFBufferizePass());
1115   pm.addNestedPass<func::FuncOp>(createLinalgBufferizePass());
1116   pm.addNestedPass<func::FuncOp>(createTensorBufferizePass());
1117   pm.addPass(createFuncBufferizePass());
1118
1119   // Finalizing bufferization pass.
1120   pm.addNestedPass<func::FuncOp>(createFinalizingBufferizePass());
1121 ```
1122
1123 Looking first at the partial bufferization passes, we see that there are a
1124 sequence of `FuncOp` passes (which run in parallel on functions). These function
1125 passes are bracketed by `arith-bufferize` and `func-bufferize`, which are module
1126 passes (and thus serialize the parallel compilation process). These two passes
1127 must be module passes because they make changes to the top-level module.
1128
1129 The bulk of the bufferization work is done by the function passes. Most of these
1130 passes are provided as part of the upstream MLIR distribution and bufferize
1131 their respective dialects (e.g. `scf-bufferize` bufferizes the `scf` dialect).
1132 The `tcp-bufferize` pass is an exception -- it is a partial bufferization pass
1133 used to bufferize the downstream `tcp` dialect, and fits in perfectly with all
1134 the other passes provided upstream.
1135
1136 The last pass is the finalizing bufferization pass. The `mlir-npcomp` reference
1137 backend has arranged that all ops are bufferized by partial bufferizations, so
1138 that the upstream `finalizing-bufferize` pass can be used as the finalizing
1139 bufferization pass. This gives excellent diagnostics when something goes wrong
1140 with the bufferization process, such as due to an op that wasn't handled by any
1141 pattern.
1142
1143 ### How to write a partial bufferization pass
1144
1145 The contract of a partial bufferization pass is that a subset of ops (or kinds
1146 of ops, customizable by a ConversionTarget) get bufferized.
1147
1148 A partial bufferization pass is just a pass that uses the
1149 [dialect conversion](DialectConversion.md) framework to apply
1150 `ConversionPattern`s with a `tensor` to `memref` type conversion.
1151
1152 To describe how to write such a pass, we will walk through an example, the
1153 `tensor-bufferize` pass
1154 ([code](https://github.com/llvm/llvm-project/blob/bc8acf2ce8ad6e8c9b1d97b2e02d3f4ad26e1d9d/mlir/lib/Dialect/Tensor/Transforms/Bufferize.cpp#L23),
1155 [test](https://github.com/llvm/llvm-project/blob/bc8acf2ce8ad6e8c9b1d97b2e02d3f4ad26e1d9d/mlir/test/Dialect/Tensor/bufferize.mlir#L1))
1156 that bufferizes the `tensor` dialect. Note that these passes have been replaced
1157 with a `BufferizableOpInterface`-based implementation in the meantime, so we
1158 have to take a looker at an older version of the code.
1159
1160 The bulk of the code in the pass will be a set of conversion patterns, with a
1161 simple example being
1162 [BufferizeCastOp](https://github.com/llvm/llvm-project/blob/2bf6e443e54604c7818c4d1a1837f3d091023270/mlir/lib/Dialect/Tensor/Transforms/Bufferize.cpp#L23)).
1163
1164 ```
1165 class BufferizeCastOp : public OpConversionPattern<tensor::CastOp> {
1166 public:
1167   using OpConversionPattern::OpConversionPattern;
1168   LogicalResult
1169   matchAndRewrite(tensor::CastOp op, OpAdaptor adaptor,
1170                   ConversionPatternRewriter &rewriter) const override {
1171     auto resultType = getTypeConverter()->convertType(op.getType());
1172     rewriter.replaceOpWithNewOp<MemRefCastOp>(op, resultType, adaptor.source());
1173     return success();
1174   }
1175 };
1176 ```
1177
1178 See [the talk](#the-talk) for more details on how to write these patterns.
1179
1180 The
1181 [pass itself](https://github.com/llvm/llvm-project/blob/bc8acf2ce8ad6e8c9b1d97b2e02d3f4ad26e1d9d/mlir/lib/Dialect/Tensor/Transforms/Bufferize.cpp#L57)
1182 is very small, and follows the basic pattern of any dialect conversion pass.
1183
1184 ```
1185 void mlir::populateTensorBufferizePatterns(
1186     BufferizeTypeConverter &typeConverter, RewritePatternSet &patterns) {
1187   patterns.add<BufferizeCastOp, BufferizeExtractOp>(typeConverter,
1188                                                     patterns.getContext());
1189 }
1190
1191 struct TensorBufferizePass : public TensorBufferizeBase<TensorBufferizePass> {
1192   void runOnOperation() override {
1193     auto *context = &getContext();
1194     BufferizeTypeConverter typeConverter;
1195     RewritePatternSet patterns(context);
1196     ConversionTarget target(*context);
1197
1198     populateTensorBufferizePatterns(typeConverter, patterns);
1199     target.addIllegalOp<tensor::CastOp, tensor::ExtractOp>();
1200     target.addLegalDialect<func::FuncDialect>();
1201
1202     if (failed(
1203             applyPartialConversion(getOperation(), target, std::move(patterns))))
1204       signalPassFailure();
1205   }
1206 };
1207 ```
1208
1209 The pass has all the hallmarks of a dialect conversion pass that does type
1210 conversions: a `TypeConverter`, a `RewritePatternSet`, and a `ConversionTarget`,
1211 and a call to `applyPartialConversion`. Note that a function
1212 `populateTensorBufferizePatterns` is separated, so that power users can use the
1213 patterns independently, if necessary (such as to combine multiple sets of
1214 conversion patterns into a single conversion call, for performance).
1215
1216 One convenient utility provided by the MLIR bufferization infrastructure is the
1217 `BufferizeTypeConverter`, which comes pre-loaded with the necessary conversions
1218 and materializations between `tensor` and `memref`.
1219
1220 In this case, the `BufferizationOpsDialect` is marked as legal, so the
1221 `bufferization.to_tensor` and `bufferization.to_memref` ops, which are inserted
1222 automatically by the dialect conversion framework as materializations, are
1223 legal. There is a helper `populateBufferizeMaterializationLegality`
1224 ([code](https://github.com/llvm/llvm-project/blob/a0b65a7bcd6065688189b3d678c42ed6af9603db/mlir/include/mlir/Transforms/Bufferize.h#L53))
1225 which helps with this in general.
1226
1227 ### Other partial bufferization examples
1228
1229 -   `scf-bufferize`
1230     ([code](https://github.com/llvm/llvm-project/blob/bc8acf2ce8ad6e8c9b1d97b2e02d3f4ad26e1d9d/mlir/lib/Dialect/SCF/Transforms/Bufferize.cpp#L1),
1231     [test](https://github.com/llvm/llvm-project/blob/bc8acf2ce8ad6e8c9b1d97b2e02d3f4ad26e1d9d/mlir/test/Dialect/SCF/bufferize.mlir#L1))
1232
1233     -   Bufferizes ops from the `scf` dialect.
1234     -   This is an example of how to bufferize ops that implement
1235         `RegionBranchOpInterface` (that is, they use regions to represent
1236         control flow).
1237     -   The bulk of the work is done by
1238         `lib/Dialect/SCF/Transforms/StructuralTypeConversions.cpp`
1239         ([code](https://github.com/llvm/llvm-project/blob/daaaed6bb89044ac58a23f1bb1ccdd12342a5a58/mlir/lib/Dialect/SCF/Transforms/StructuralTypeConversions.cpp#L1)),
1240         which is well-commented and covers how to correctly convert ops that
1241         contain regions.
1242
1243 -   `func-bufferize`
1244     ([code](https://github.com/llvm/llvm-project/blob/2f5715dc78328215d51d5664c72c632a6dac1046/mlir/lib/Dialect/Func/Transforms/FuncBufferize.cpp#L1),
1245     [test](https://github.com/llvm/llvm-project/blob/2f5715dc78328215d51d5664c72c632a6dac1046/mlir/test/Dialect/Func/func-bufferize.mlir#L1))
1246
1247     -   Bufferizes `func`, `call`, and `BranchOpInterface` ops.
1248     -   This is an example of how to bufferize ops that have multi-block
1249         regions.
1250     -   This is an example of a pass that is not split along dialect
1251         subdivisions.
1252
1253 ### How to write a finalizing bufferization pass
1254
1255 The contract of a finalizing bufferization pass is that all tensors are gone
1256 from the program.
1257
1258 The easiest way to write a finalizing bufferize pass is to not write one at all!
1259 MLIR provides a pass `finalizing-bufferize` which eliminates the
1260 `bufferization.to_tensor` / `bufferization.to_memref` materialization ops
1261 inserted by partial bufferization passes and emits an error if that is not
1262 sufficient to remove all tensors from the program.
1263
1264 This pass is sufficient when partial bufferization passes have bufferized all
1265 the ops in the program, leaving behind only the materializations. When possible,
1266 it is recommended to structure your pass pipeline this way, as this has the
1267 significant advantage that if an op does not get bufferized (due to a missing
1268 pattern, bug in the code, etc.), `finalizing-bufferize` will emit a nice clean
1269 error, and the IR seen by `finalizing-bufferize` will only contain only one
1270 unbufferized op.
1271
1272 However, before the current bufferization infrastructure was put in place,
1273 bufferization could only be done as a single finalizing bufferization mega-pass
1274 that used the `populate*BufferizePatterns` functions from multiple dialects to
1275 simultaneously bufferize everything at once. Thus, one might see code in
1276 downstream projects structured this way. This structure is not recommended in
1277 new code. A helper, `populateEliminateBufferizeMaterializationsPatterns`
1278 ([code](https://github.com/llvm/llvm-project/blob/a0b65a7bcd6065688189b3d678c42ed6af9603db/mlir/include/mlir/Transforms/Bufferize.h#L58))
1279 is available for such passes to provide patterns that eliminate
1280 `bufferization.to_tensor` and `bufferization.to_memref`.
1281
1282 ### Changes since [the talk](#the-talk)
1283
1284 -   `func-bufferize` was changed to be a partial conversion pass, and there is a
1285     new `finalizing-bufferize` which serves as a general finalizing
1286     bufferization pass.
1287 -   Most partial bufferization passes have been reimplemented in terms of
1288     `BufferizableOpInterface`. New users should use One-Shot Bufferize instead
1289     of dialect conversion-based bufferization.