mlir/docs/Bufferization.md

   1 # Bufferization
   2
   3 [TOC]
   4
   5 ## Overview
   6
   7 Bufferization in MLIR is the process of converting ops with `tensor` semantics
   8 to ops with `memref` semantics. There are multiple MLIR passes that are related
   9 to bufferization. These passes typically run as one of the last steps in a
  10 pass pipeline, right before lowering to `memref` ops to LLVM. That is because
  11 many transformations are easier or only supported in tensor land; e.g.,
  12 [tile/fuse/… on tensors first](https://llvm.discourse.group/t/rfc-linalg-on-tensors-update-and-comprehensive-bufferization-rfc/3373),
  13 then bufferize the remaining IR.
  14
  15 ![bufferization passes](/includes/img/bufferization_passes.svg)
  16
  17 The most important bufferization pass is *One-Shot Bufferize*: This pass
  18 rewrites `tensor` IR to `memref` IR. There are additional helper passes that
  19 preprocess IR (e.g., so that IR can be bufferized more efficiently), perform
  20 buffer-level optimizations such as allocation hoisting, and
  21 [insert buffer deallocation ops](OwnershipBasedBufferDeallocation.md) so that
  22 the resulting `memref` IR has no memory leaks.
  23
  24 ## Deprecated Passes
  25
  26 The buffer deallocation pass has been deprecated in favor of the ownership-based
  27 buffer deallocation pipeline. The deprecated pass has some limitations that may
  28 cause memory leaks in the resulting IR.
  29
  30 ## What is One-Shot Bufferize?
  31
  32 One-Shot Bufferize is a tensor bufferization pass designed for IR in
  33 [destination-passing style](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/11/dps-fhpc17.pdf),
  34 and with aggressive in-place bufferization.
  35
  36 One-Shot Bufferize is:
  37
  38 *   **Monolithic**: A single MLIR pass does the entire work.
  39
  40 *   **Extensible** via an op interface: All ops that implement
  41     `BufferizableOpInterface` can be bufferized.
  42
  43 *   A **whole-function at a time analysis**. In-place bufferization decisions
  44     are made by analyzing SSA use-def chains on tensors. Op interface
  45     implementations not only provide the rewrite logic from tensor ops to memref
  46     ops, but also helper methods for One-Shot Bufferize's analysis to query
  47     information about an op's bufferization/memory semantics.
  48
  49 *   **2-Phase**: Bufferization is internally broken down into 2 steps: First,
  50     analyze the entire IR and make bufferization decisions. Then, bufferize
  51     (rewrite) the IR. The analysis has access to exact SSA use-def information.
  52     It incrementally builds alias and equivalence sets and does not rely on a
  53     posteriori-alias analysis from preallocated memory.
  54
  55 *   **Greedy**: Operations are analyzed one-by-one and it is decided on the spot
  56     whether a tensor OpOperand must be copied or not. Heuristics determine the
  57     order of analysis.
  58
  59 *   **Modular**: The current One-Shot Analysis can be replaced with a different
  60     analysis. The result of the analysis are queried by the bufferization via
  61     `AnalysisState`, in particular `AnalysisState::isInPlace`. Any derived class
  62     of `AnalysisState` that implements a small number virtual functions can
  63     serve as a custom analysis. It is even possible to run One-Shot Bufferize
  64     without any analysis (`AlwaysCopyAnalysisState`), in which case One-Shot
  65     Bufferize copies every buffer before writing to it.
  66
  67 Note that One-Shot Bufferize does not deallocate buffers. That is done by the
  68 [Ownership-based Buffer Deallocation passes](OwnershipBasedBufferDeallocation.md).
  69
  70 ## Goals of Bufferization
  71
  72 The high-level goal of every bufferization technique is to:
  73
  74 1. Use as little memory as possible.
  75 2. Copy as little memory as possible.
  76
  77 This implies reusing already allocated buffers when possible, turning
  78 bufferization into an algorithmically complex problem with similarities to
  79 register allocation.
  80
  81 Depending on the concrete use case, there may be additional bufferization
  82 requirements. If the contents of a buffer are expensive to compute, there could
  83 be a tradeoff between *recomputation* and *compute once and copy*. On the
  84 contrary, it may not even be possible to allocate new buffers at runtime on some
  85 architectures.
  86
  87 ## Destination-Passing Style
  88
  89 Bufferization is an algorithmically complex problem. Given an op with a tensor
  90 result, bufferization has to choose a memref buffer in which the result can be
  91 stored. It is always safe to allocate a brand new buffer, but such a
  92 bufferization strategy would be unacceptable for high-performance codegen. When
  93 choosing an already existing buffer, we must be careful not to accidentally
  94 overwrite data that is still needed later in the program.
  95
  96 To simplify this problem, One-Shot Bufferize was designed to take advantage of
  97 *destination-passing style* (DPS). In MLIR, DPS op should implement the
  98 [`DestinationStyleOpInterface`](https://github.com/llvm/llvm-project/blob/792d437b56adfb3416daf8105942d4899fb82763/mlir/include/mlir/Interfaces/DestinationStyleOpInterface.td).
  99 DPS exists in itself independently of bufferization and is tied to SSA
 100 semantics: many ops are "updating" a part of their input SSA variables. For
 101 example the LLVM instruction
 102 [`insertelement`](https://llvm.org/docs/LangRef.html#insertelement-instruction)
 103 is inserting an element inside a vector. Since SSA values are immutable, the
 104 operation returns a copy of the input vector with the element inserted.
 105 Another example in MLIR is `linalg.generic` on tensors, which always has an
 106 extra `outs` operand for each result, which provides the initial values to
 107 update (for example when the operation is doing a reduction).
 108
 109 `outs` operands are referred to as "destinations" in the following (quotes are
 110 important as this operand isn't modified in place but copied) and comes into
 111 place in the context of bufferization as a possible "anchor" for the
 112 bufferization algorithm. This allows the user to shape the input in a form that
 113 guarantees close to optimal bufferization result when carefully choosing the
 114 SSA value used as "destination".
 115
 116 For every tensor result, a DPS op has a corresponding tensor operand. If there
 117 aren't any other conflicting uses of this tensor, the bufferization can alias
 118 it with the op result and perform the operation "in-place" by reusing the buffer
 119 allocated for this "destination" input.
 120
 121 As an example, consider the following op: `%r = tensor.insert %f into
 122 %t[%idx] : tensor<5xf32>`
 123
 124 ![tensor.insert example](/includes/img/bufferization_tensor_insert_dst.svg)
 125
 126 `%t` is the "destination" in this example. When choosing a buffer for the result
 127 `%r`, denoted as `buffer(%r)`, One-Shot Bufferize considers only two options:
 128
 129 1.  `buffer(%r) = buffer(%t)`: store the result in the existing `buffer(%t)`.
 130     Note that this is not always possible. E.g., if the old contents of
 131     `buffer(%t)` are still needed. One-Shot Bufferize's main task is to detect
 132     such cases and fall back to the second option when necessary.
 133 2.  `buffer(%r)` is a newly allocated buffer.
 134
 135 There may be other buffers in the same function that could potentially be used
 136 for `buffer(%r)`, but those are not considered by One-Shot Bufferize to keep the
 137 bufferization simple. One-Shot Bufferize could be extended to consider such
 138 buffers in the future to achieve a better quality of bufferization.
 139
 140 Tensor ops that are not in destination-passing style always bufferized to a
 141 memory allocation. E.g.:
 142
 143 ```mlir
 144 %0 = tensor.generate %sz {
 145 ^bb0(%i : index):
 146   %cst = arith.constant 0.0 : f32
 147   tensor.yield %cst : f32
 148 } : tensor<?xf32>
 149 ```
 150
 151 The result of `tensor.generate` does not have a "destination" operand, so
 152 bufferization allocates a new buffer. This could be avoided by instead using an
 153 op such as `linalg.generic`, which can express the same computation with a
 154 "destination" operand, as specified behind outputs (`outs`):
 155
 156 ```mlir
 157 #map = affine_map<(i) -> (i)>
 158 %0 = linalg.generic {indexing_maps = [#map], iterator_types = ["parallel"]}
 159                     outs(%t : tensor<?xf32>) {
 160   ^bb0(%arg0 : f32):
 161     %cst = arith.constant 0.0 : f32
 162     linalg.yield %cst : f32
 163 } -> tensor<?xf32>
 164 ```
 165
 166 At first glance, the above `linalg.generic` op may not seem very useful because
 167 the output tensor `%t` is entirely overwritten. Why pass the tensor `%t` as an
 168 operand in the first place? As an example, this can be useful for overwriting a
 169 slice of a tensor:
 170
 171 ```mlir
 172 %t = tensor.extract_slice %s [%idx] [%sz] [1] : tensor<?xf32> to tensor<?xf32>
 173 %0 = linalg.generic ... outs(%t) { ... } -> tensor<?xf32>
 174 %1 = tensor.insert_slice %0 into %s [%idx] [%sz] [1]
 175     : tensor<?xf32> into tensor<?xf32>
 176 ```
 177
 178 The above example bufferizes to a `memref.subview`, followed by a
 179 "`linalg.generic` on memrefs" that overwrites the memory of the subview, assuming
 180 that the slice `%t` has no other user. The `tensor.insert_slice` then bufferizes
 181 to a no-op (in the absence of RaW conflicts such as a subsequent read of `%s`).
 182
 183 RaW conflicts are detected with an analysis of SSA use-def chains (details
 184 later). One-Shot Bufferize works best if there is a single SSA use-def chain,
 185 where the result of a tensor op is the operand of the next tensor ops, e.g.:
 186
 187 ```mlir
 188 %0 = "my_dialect.some_op"(%t) : (tensor<?xf32>) -> (tensor<?xf32>)
 189 %1 = "my_dialect.another_op"(%0) : (tensor<?xf32>) -> (tensor<?xf32>)
 190 %2 = "my_dialect.yet_another_op"(%1) : (tensor<?xf32>) -> (tensor<?xf32>)
 191 ```
 192
 193 Buffer copies are likely inserted if the SSA use-def chain splits at some point,
 194 e.g.:
 195
 196 ```mlir
 197 %0 = "my_dialect.some_op"(%t) : (tensor<?xf32>) -> (tensor<?xf32>)
 198 %1 = "my_dialect.another_op"(%0) : (tensor<?xf32>) -> (tensor<?xf32>)
 199
 200 // "yet_another_op" likely needs to read the data of %0, so "another_op" cannot
 201 // in-place write to buffer(%0).
 202 %2 = "my_dialect.yet_another_op"(%0) : (tensor<?xf32>) -> (tensor<?xf32>)
 203 ```
 204
 205 ## Tensor / MemRef Boundary
 206
 207 The bufferization dialect provides a few helper ops to connect tensor IR (that
 208 should be bufferized) with existing buffers (that may be allocated/provided by
 209 a different runtime/library/etc.).
 210
 211 `bufferization.to_memref %t` returns the future buffer of a tensor SSA value.
 212 `bufferization.to_tensor %m` returns a tensor SSA value for a given MemRef
 213 buffer. `bufferization.materialize_in_destination` indicates that a tensor value
 214 should materialize in a certain buffer.
 215
 216 Consider the following example, where a TOSA matmul result should materialize in
 217 an existing buffer `%C`:
 218
 219 ```mlir
 220 // Batched TOSA matrix multiplication. %A and %B are the
 221 // inputs, %C is the output.
 222 func.func @test_matmul(%A: memref<1x17x19xf32>,
 223                        %B: memref<1x19x29xf32>,
 224                        %C: memref<1x17x29xf32>) {
 225
 226   %A_tensor = bufferization.to_tensor %A restrict : memref<1x17x19xf32>
 227   %B_tensor = bufferization.to_tensor %B restrict : memref<1x19x29xf32>
 228
 229   %0 = tosa.matmul %A_tensor, %B_tensor
 230       : (tensor<1x17x19xf32>, tensor<1x19x29xf32>) ->
 231          tensor<1x17x29xf32>
 232
 233   bufferization.materialize_in_destination
 234     %0 in restrict writable %C
 235       : (tensor<1x17x29xf32>, memref<1x17x29xf32>) -> ()
 236
 237   return
 238 }
 239 ```
 240
 241 Note that all bufferization ops in this example have the `restrict` unit
 242 attribute set. This attribute is similar to the C restrict keyword and indicates
 243 that there is no other `to_tensor` or `materialize_in_destination` op with
 244 the same or an aliasing MemRef operand. Only such
 245 `to_tensor`/`materialize_in_destination` ops are supported. The `restrict`
 246 attribute gives strong aliasing guarantees to the bufferization analysis and
 247 allows us to look only at the tensor IR in a program. (Ops that do not operate
 248 on tensors are ignored by the One-Shot Bufferize.)
 249
 250 Also note that `tosa.matmul` cannot be bufferized as is: there is no
 251 `BufferizableOpInterface` implementation for that op. However, the op can be
 252 lowered to a combination of `tensor.empty` and `linalg.matmul`, which can be
 253 bufferized.
 254
 255 ## Using One-Shot Bufferize
 256
 257 MLIR provides a pass
 258 [`-one-shot-bufferize`](https://mlir.llvm.org/docs/Passes/#-one-shot-bufferize-one-shot-bufferize)
 259 that performs an analysis and bufferizes all ops with tensor semantics that
 260 implement `BufferizableOpInterface`. For modularity reasons, these op interface
 261 implementations are typically external models that live in a dialect's
 262 "Transforms" build unit. (External models are a mechanism for implementing an op
 263 interface in a different build unit.) It is the user's responsibility to ensure
 264 that all needed external models are registered before running One-Shot
 265 Bufferize.
 266
 267 By default, One-Shot Bufferize fails when it encounters an op with tensor
 268 semantics (i.e., tensor result or tensor operand) that is not bufferizable
 269 (i.e., does not implement `BufferizableOpInterface`). This can be avoided with
 270 `allow-unknown-ops`. In that case, One-Shot Bufferize inserts
 271 `to_memref`/`to_tensor` ops around the bufferization boundary.
 272
 273 One-Shot Bufferize can be configured to bufferize only ops from a set of
 274 dialects with `dialect-filter`.
 275
 276 One-Shot Bufferize can also be called programmatically with
 277 [`bufferization::runOneShotBufferize`](https://github.com/llvm/llvm-project/blob/ae2764e835a26bad9774803eca0a6530df2a3e2d/mlir/include/mlir/Dialect/Bufferization/Transforms/OneShotAnalysis.h#L167).
 278 Alternatively,
 279 [`bufferization::bufferizeOp`](https://github.com/llvm/llvm-project/blob/ae2764e835a26bad9774803eca0a6530df2a3e2d/mlir/include/mlir/Dialect/Bufferization/Transforms/Bufferize.h#L78)
 280 skips the analysis and inserts a copy on every buffer write.
 281
 282 By default, function boundaries are not bufferized. This is because there are
 283 currently limitations around function graph bufferization: recursive
 284 calls are not supported. As long as there are no recursive calls, function
 285 boundary bufferization can be enabled with `bufferize-function-boundaries`. Each
 286 tensor function argument and tensor function result is then turned into a
 287 memref. The layout map of the memref type can be controlled with
 288 `function-boundary-type-conversion`.
 289
 290 ## Memory Layouts
 291
 292 One-Shot Bufferize bufferizes ops from top to bottom. This works well when all
 293 ops are bufferizable. However, when encountering a non-bufferizable tensor with
 294 `allow-unknown-ops`, One-Shot Bufferize must insert `to_memref` ops at the
 295 bufferization boundary and decide on a memref type. By default, One-Shot
 296 Bufferize choose the most dynamic memref type wrt. layout maps. E.g.:
 297
 298 ```mlir
 299 %0 = "my_dialect.unbufferizable_op(%t) : (tensor<?x?xf32>) -> (tensor<?x?xf32>)
 300 %1 = tensor.extract %0[%idx1, %idx2] : tensor<?xf32>
 301 ```
 302
 303 When bufferizing the above IR, One-Shot Bufferize inserts a `to_memref` ops with
 304 dynamic offset and strides:
 305
 306 ```mlir
 307 %0 = "my_dialect.unbufferizable_op(%t) : (tensor<?x?xf32>) -> (tensor<?x?xf32>)
 308 %0_m = bufferization.to_memref %0 : memref<?x?xf32, strided<[?, ?], offset: ?>>
 309 %1 = memref.load %0_m[%idx1, %idx2] : memref<?x?xf32, strided<[?, ?], offset: ?>>
 310 ```
 311
 312 All users of `%0` have fully dynamic layout maps. This ensures that the
 313 bufferized IR composes well with future bufferizations of `unbufferizable_op`
 314 (maybe bufferized by another pass), regardless of the exact memref type of the
 315 future bufferization. If the op turns out to be bufferized to an op with a
 316 simpler memref type (e.g., identity layout map), we expect that canonicalization
 317 patterns would clean up unnecessarily dynamic layout maps. (Some of these
 318 canonicalization patterns may not be implemented yet.)
 319
 320 One-Shot Bufferize tries to infer the most precise memref type when bufferizing
 321 an op. If the entire IR is bufferizable, we do not have to resort to
 322 conservatively use fully dynamic layout maps. In that case, we also do not have
 323 to rely on canonicalization patterns to clean up the bufferized IR.
 324
 325 Note: There are some bufferizable ops for which a percise layout map cannot be
 326 inferred. E.g., a `tensor.cast` from a `tensor<*xf32>` to a `tensor<?x?xf32>`
 327 must be bufferized to a `memref.cast` with a memref type that has a fully
 328 dynamic layout map.
 329
 330 One-Shot Bufferize has an option `unknown-type-conversion` to control the
 331 generation of layout maps when no precise layout can be inferred:
 332
 333 *   `fully-dynamic-layout-map` uses fully dynamic layout maps and is the default
 334     behavior. This composes well when IR is partially bufferized.
 335 *   `identity-layout-map` uses static identity layout maps. This option can be
 336     useful for legacy code that cannot handle memref types with layout maps.
 337     Note that this setting can lead to additional buffer copies when folding a
 338     `to_tensor`/`to_memref` pair with memref types that are not cast-compatible.
 339
 340 Note: The `unknown-type-conversion` option does not affect layout maps of
 341 function signatures. There is a separate `function-signature-type-conversion`
 342 option that controls layout maps of function parameters and function results.
 343
 344 ## Extending One-Shot Bufferize
 345
 346 Custom ops can be bufferized if they implement `BufferizableOpInterface`. Users
 347 must at least implement the following interface methods.
 348
 349 *   `bufferizesToMemoryRead`: Return `true` if the buffer of the given tensor
 350     OpOperand is read.
 351 *   `bufferizesToMemoryWrite`: Return `true` if the buffer of the given tensor
 352     OpOperand is written (if bufferizing in-place).
 353 *   `getAliasingOpResult`: Return the OpResults that may share the same buffer
 354     as the given OpOperand. This interface method describes to
 355     OpOperand-to-OpResult mapping wrt. destination-passing style.
 356 *   `bufferRelation`: Return `BufferRelation::Equivalent` if the given OpResult
 357     is the exact same memref as the aliasing OpOperand after bufferization (in
 358     case of in-place bufferization). Otherwise, (e.g., they overlap but are not
 359     necessarily the exact same memrefs), `BufferRelation::Unknown` should be
 360     returned. Additional buffer relations will be added in the future, but
 361     `BufferRelation::Unknown` is always safe.
 362 *   `bufferize`: Rewrite the op with the given rewriter. Ops should be replaced
 363     with `bufferization::replaceOpWithBufferizedValues`.
 364
 365 To get a better intuition of the interface methods, we invite users to take a
 366 look at existing implementations in MLIR, e.g., the implementation of
 367 `tensor.insert` or `tensor.extract`.
 368
 369 Interface implementations of DPS ops (that implement
 370 `DestinationStyleOpInterface`) can derive from
 371 `DstBufferizableOpInterfaceExternalModel`, which provides all necessary
 372 method implementations except for `bufferize`.
 373
 374 ## Debugging Buffer Copies
 375
 376 To get a better understanding of why One-Shot Bufferize introduced a buffer
 377 copy, users can run the pass with `test-analysis-only print-conflicts`. Every
 378 tensor op is then annotated with an attribute that has a boolean value for each
 379 tensor OpOperand. `true` means that the OpOperand bufferizes in-place. `false`
 380 means that the OpOperand bufferizes out-of-place and a buffer copy will be
 381 inserted.
 382
 383 There are two reasons why a buffer copy may be inserted.
 384
 385 1.  Due to a RaW conflict, it is not safe to bufferize in-place. I.e., the
 386     overwritten data is still needed.
 387 2.  The buffer is not writable. E.g., `memref.global` buffers that are the
 388     result of `arith.constant` ops are never modified.
 389
 390 In the first case, `print-conflicts` illustrates the conflict in the form of a
 391 ("read", "conflicting write", "last write") tuple.
 392
 393 A RaW conflict consists of three parts, in the following order according to
 394 op dominance:
 395
 396 1. **Definition:** A tensor `%t` is defined.
 397 2. **Conflicting Write:** An operation writes to `buffer(%t)`.
 398 3. **Read:** An operation reads `%t`.
 399
 400 When such a RaW conflict is detected during the analysis phase, One-Shot
 401 Bufferize will insert a buffer copy for the conflicting write.
 402
 403 **Example**
 404
 405 ```mlir
 406 // RUN: mlir-opt %s -one-shot-bufferize="bufferize-function-boundaries test-analysis-only print-conflicts"
 407 func.func @test(%arg0: f32, %arg1: f32, %arg2: index, %arg3: index) -> (f32, tensor<3xf32>) {
 408   // Create a new tensor with [%arg0, %arg0, %arg0].
 409   %0 = tensor.from_elements %arg0, %arg0, %arg0 : tensor<3xf32>
 410
 411   // Insert something into the new tensor.
 412   %1 = tensor.insert %arg1 into %0[%arg2] : tensor<3xf32>
 413
 414   // Read from the old tensor.
 415   %r = tensor.extract %0[%arg3] : tensor<3xf32>
 416
 417   // Return the extracted value and the result of the insertion.
 418   func.return %r, %1 : f32, tensor<3xf32>
 419 }
 420 ```
 421
 422 The output IR is as follows:
 423
 424 ```mlir
 425 func.func @test(%arg0: f32, %arg1: f32, %arg2: index, %arg3: index) -> (f32, tensor<3xf32>) {
 426   %from_elements = tensor.from_elements %arg0, %arg0, %arg0 {"C_0[DEF: result 0]"} : tensor<3xf32>
 427   %inserted = tensor.insert %arg1 into %from_elements[%arg2] {"C_0[CONFL-WRITE: 1]", __inplace_operands_attr__ = ["none", "false", "none"]} : tensor<3xf32>
 428   %extracted = tensor.extract %from_elements[%arg3] {"C_0[READ: 0]", __inplace_operands_attr__ = ["true", "none"]} : tensor<3xf32>
 429   return {__inplace_operands_attr__ = ["none", "true"]} %extracted, %inserted : f32, tensor<3xf32>
 430 }
 431 ```
 432
 433 Note that the IR was not bufferized. It was merely annotated with the results
 434 of the bufferization analysis. Every operation with tensor semantics has a
 435 `__inplace_operands_attr__` attribute with one value per operand. If an operand
 436 is not a tensor, the respective value is `none`. Otherwise, if the operand was
 437 decided to be bufferized in-place, the value is `true`. A value of `false`
 438 indicates a buffer copy. In the above example, a buffer copy would be inserted
 439 for `tensor.insert`, so that it does not overwrite `buffer(%from_elements)`,
 440 which is still needed for `tensor.extract`.
 441
 442 For each RaW (there is only one in the example), three `C_i` attributes were
 443 added:
 444
 445 * `C_0[DEF: result 0]`: A tensor is defined: 0-th result of
 446   `tensor.from_elements`.
 447 * `C_0[CONFL-WRITE: 1]`: An operation (if bufferized in-place) would write into
 448   the future buffer of the defined tensor: 1-st operand of `tensor.insert`.
 449 * `C_0[READ: 0]`: An operation reads the tensor definition: 0-th operand of
 450   `tensor.extract`.
 451
 452 The fully bufferized IR (with the inserted buffer copy) is as follows:
 453
 454 ```mlir
 455 func.func @test(%arg0: f32, %arg1: f32, %arg2: index, %arg3: index) -> (f32, memref<3xf32>) {
 456   %c2 = arith.constant 2 : index
 457   %c1 = arith.constant 1 : index
 458   %c0 = arith.constant 0 : index
 459   %alloc = memref.alloc() {alignment = 64 : i64} : memref<3xf32>
 460   memref.store %arg0, %alloc[%c0] : memref<3xf32>
 461   memref.store %arg0, %alloc[%c1] : memref<3xf32>
 462   memref.store %arg0, %alloc[%c2] : memref<3xf32>
 463   %alloc_0 = memref.alloc() {alignment = 64 : i64} : memref<3xf32>
 464   memref.copy %alloc, %alloc_0 : memref<3xf32> to memref<3xf32>
 465   memref.store %arg1, %alloc_0[%arg2] : memref<3xf32>
 466   %0 = memref.load %alloc[%arg3] : memref<3xf32>
 467   return %0, %alloc_0 : f32, memref<3xf32>
 468 }
 469 ```
 470
 471 To get a better understanding of the SSA Use-Def Chain Analysis and the RaW
 472 conflict detection algorithm, interested users may want to refer to:
 473
 474 * [Original design document](https://discourse.llvm.org/uploads/short-url/5kckJ3DftYwQokG252teFgw3sYa.pdf)
 475 * [ODM talk](https://youtu.be/TXEo59CYS9A), ([slides](https://mlir.llvm.org/OpenMeetings/2022-01-13-One-Shot-Bufferization.pdf)).
 476 * [LLVM Dev Meeting 2023 tutorial slides](https://m-sp.org/downloads/llvm_dev_2023.pdf)