mlir/docs/Rationale.md

   1 # MLIR Rationale
   2
   3 This document is intended to capture some of the alternatives considered and
   4 open debates in the design of MLIR, along with the rationale for certain
   5 decisions we made. This is not intended to be a "finely groomed" document - we
   6 prefer the ability to dump in interesting tidbits without worrying too much
   7 about their consistency or readability.
   8
   9 [TOC]
  10
  11 ## Abstract
  12
  13 MLIR is a compiler intermediate representation with similarities to traditional
  14 three-address SSA representations (like
  15 [LLVM IR](http://llvm.org/docs/LangRef.html) or
  16 [SIL](https://github.com/apple/swift/blob/master/docs/SIL.rst)), but which
  17 introduces notions from the polyhedral loop optimization works as first class
  18 concepts. This hybrid design is optimized to represent, analyze, and transform
  19 high level dataflow graphs as well as target-specific code generated for high
  20 performance data parallel systems. Beyond its representational capabilities, its
  21 single continuous design provides a framework to lower from dataflow graphs to
  22 high performance target specific code.
  23
  24 MLIR stands for one of "Multi-Level IR" or "Multi-dimensional Loop IR" or
  25 "Machine Learning IR" or "Mid Level IR", we prefer the first. This document only
  26 provides the rationale behind MLIR -- its actual
  27 [specification document](LangRef.md) and other content is hosted elsewhere.
  28
  29 ## Introduction and Motivation
  30
  31 The Multi-Level Intermediate Representation (MLIR) is intended for easy
  32 expression and optimization of computations involving deep loop nests and dense
  33 matrices of high dimensionality. It is thus well-suited to deep learning
  34 computations in particular. Yet it is general enough to also represent arbitrary
  35 sequential computation. The representation allows high-level optimization and
  36 parallelization for a wide range of parallel architectures including those with
  37 deep memory hierarchies --- general-purpose multicores, GPUs, and specialized
  38 neural network accelerators.
  39
  40 MLIR uses ideas drawn from IRs of LLVM and Swift for lower level constructs
  41 while combining them with ideas from the polyhedral abstraction to represent
  42 loop nests, multidimensional data (tensors), and transformations on these
  43 entities as first class concepts in the IR.
  44
  45 MLIR is a multi-level IR, i.e., it represents code at a domain-specific
  46 representation such as HLO or TensorFlow graphs, all the way down to the machine
  47 level. MLIR is able to represent arbitrary control flow and arbitrary data
  48 accesses, and is general enough to represent nearly all sequential computation.
  49 This is a key distinction from existing polyhedral representation
  50 implementations (such as LLVM [Polly](https://polly.llvm.org/)) that are able to
  51 use the polyhedral abstraction in a way isolated from the LLVM IR and only for
  52 affine loop nests, i.e., portions of the code where array accesses, loop bounds,
  53 and conditionals are regular (involve linear functions of loop iterators and
  54 constant symbols). The presence of statically unpredictable data accesses or
  55 control flow does not preclude representation in MLIR, but only limits to a
  56 certain extent the ability to reason about and apply transformations using the
  57 polyhedral abstraction.
  58
  59 Maps, sets, and relations with affine constraints are the core structures
  60 underlying a polyhedral representation of high-dimensional loop nests and
  61 multidimensional arrays. These structures are represented as textual
  62 expressions in a form close to their mathematical form. These structures are
  63 used to capture loop nests, tensor data structures, and how they are reordered
  64 and mapped for a target architecture. All structured or "conforming" loops are
  65 captured as part of the polyhedral information, and so are tensor variables,
  66 their layouts, and subscripted accesses to these tensors in memory.
  67
  68 The information captured in the IR allows a compact expression of all loop
  69 transformations, data remappings, explicit copying necessary for explicitly
  70 addressed memory in accelerators, mapping to pre-tuned expert written
  71 primitives, and mapping to specialized vector instructions. Loop transformations
  72 that can be easily implemented include the body of affine transformations: these
  73 subsume all traditional loop transformations (unimodular and non-unimodular)
  74 such as loop tiling, interchange, permutation, skewing, scaling, relative
  75 shifting, reversal, fusion, and distribution/fission. Transformations on data
  76 layout such as padding and transforming to blocked layouts are also represented
  77 well via affine layout maps.
  78
  79 MLIR's design allows a progressive lowering to target-specific forms. Besides
  80 high-level transformations for loop nests and data layouts that a typical
  81 mid-level optimizer is expected to deal with, MLIR is also designed to perform
  82 certain low-level scheduling and mapping decisions that a typical backend IR is
  83 entrusted with: these include mapping to specialized vector instructions,
  84 auto-vectorization, and software pipelining. The need to support these
  85 transformations stems from the fact that neural network accelerators have
  86 specialized units that deal with large chunks of data whose computation maps
  87 back to chunks of more than one loop of the loop nests as viewed by a program at
  88 a level closer to the original specification. Such specialized units or
  89 instructions operate on multidimensional data chunks from a programmer's
  90 viewpoint. It thus makes it hard or infeasible for a backend operating on a very
  91 low-level IR close to assembly to lift and reconstruct loops and perform such a
  92 mapping. This is in contrast to classic instruction selection and scheduling in
  93 today's compilers that primarily only deals with the body of the innermost loop.
  94 MLIR also facilitates automatic mapping to expert pre-tuned primitives or vendor
  95 libraries operating on data at higher levels (or at the highest level) of the
  96 memory hierarchy.
  97
  98 In summary, MLIR is convenient for and closed under the kind of transformations
  99 needed to lower to general-purpose as well as specialized accelerators. It also
 100 allows one to build modular and reusable target independent and target dependent
 101 passes.
 102
 103 ## Design Decisions
 104
 105 This section sheds light on some of the design decisions -- some of these are
 106 indirectly implied by the specification document.
 107
 108 ### Loads and stores
 109
 110 The 'load' and 'store' instructions are specifically crafted to fully resolve to
 111 an element of a memref. These instructions take as arguments n+1 indices for an
 112 n-ranked tensor. This disallows the equivalent of pointer arithmetic or the
 113 ability to index into the same memref in other ways (something which C arrays
 114 allow for example). Furthermore, for the affine constructs, the compiler can
 115 follow use-def chains (e.g. through
 116 [affine.apply operations](Dialects/Affine.md#affineapply-operation)) or through
 117 the map attributes of [affine operations](Dialects/Affine.md#Operations)) to
 118 precisely analyze references at compile-time using polyhedral techniques. This
 119 is possible because of the [restrictions on dimensions and symbols](Dialects/Affine.md#restrictions-on-dimensions-and-symbols).
 120
 121 A scalar of element-type (a primitive type or a vector type) that is stored in
 122 memory is modeled as a 0-d memref. This is also necessary for scalars that are
 123 live out of for loops and if conditionals in a function, for which we don't yet
 124 have an SSA representation --
 125 [an extension](#mlfunction-extensions-for-"escaping-scalars") to allow that is
 126 described later in this doc.
 127
 128 ### Symbols and types
 129
 130 The current MLIR disallows use of symbols in types. For example, when a tensor
 131 or memref dimension is statically unknown, it is denoted in the type as '?'. An
 132 SSA symbol is then bound to it when a memref is created. The actual value of the
 133 unknown dimension can be queried using the "dim" builtin as shown below.
 134
 135 Example:
 136
 137 ```mlir
 138 func foo(...) {
 139   %A = alloc <8x?xf32, #lmap> (%N)
 140   ...
 141   call bar(%A) : (memref<8x?xf32, #lmap>)
 142 }
 143
 144 func bar(%A : memref<8x?xf32, #lmap>) {
 145   // Type of %A indicates that %A has dynamic shape with 8 rows
 146   // and unknown number of columns. The number of columns is queried
 147   // dynamically using dim instruction.
 148   %N = dim %A, 1 : memref<8x?xf32, #lmap>
 149
 150   affine.for %i = 0 to 8 {
 151     affine.for %j = 0 to %N {
 152       // A[i,j] += 1
 153       %s1 = affine.load %A[%i, %j] : memref<8x?xf32, #lmap>
 154       %s2 = add %s1, 1
 155       affine.store %s2, %A[%i, %j] : memref<8x?xf32, #lmap>
 156     }
 157   }
 158   return
 159 }
 160
 161 ```
 162
 163 An alternative design is to embed the reference to symbols directly in the
 164 type - memref<8x%Nxf32>. We went for the current approach in MLIR because it
 165 simplifies the design --- types remain immutable when the values of symbols
 166 change.
 167
 168 ### Block Arguments vs PHI nodes
 169
 170 MLIR Regions represent SSA using "[block arguments](LangRef.md#blocks)" rather
 171 than [PHI instructions](http://llvm.org/docs/LangRef.html#i-phi) used in LLVM.
 172 This choice is representationally identical (the same constructs can be
 173 represented in either form) but block arguments have several advantages:
 174
 175 1.  LLVM PHI nodes always have to be kept at the top of a block, and
 176     transformations frequently have to manually skip over them. This is defined
 177     away with BB arguments.
 178 1.  LLVM has a separate function Argument node. This is defined away with BB
 179     arguments, because the arguments to the entry block serve this purpose.
 180 1.  Blocks of PHI nodes in LLVM execute atomically, which is surprising and
 181     super confusing to compiler engineers and it is easy to introduce bugs with
 182     this (very related to the
 183     "[lost copy](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.524.5461&rep=rep1&type=pdf)"
 184     problem in SSA lowering literature.) With the BB argument representation,
 185     this confusion is defined away.
 186 1.  The entry list of PHI nodes in LLVM are unordered, and some blocks have
 187     thousands of predecessors (e.g. unwind blocks). This can cause long compile
 188     time problems because transformations have to linearly scan this list. This
 189     is defined away with BB argument representation.
 190 1.  LLVM has no way to represent values that are available only in one successor
 191     but not the other, e.g. its invoke instruction cannot produce the exception
 192     value JUST on the exception edge. Instead, the
 193     [landingpad instruction](http://llvm.org/docs/LangRef.html#landingpad-instruction)
 194     is a hack used to represent this. MLIR doesn't make use of this capability,
 195     but SIL uses it extensively, e.g. in the
 196     [switch_enum instruction](https://github.com/apple/swift/blob/master/docs/SIL.rst#switch-enum).
 197
 198 For more context, block arguments were previously used in the Swift
 199 [SIL Intermediate Representation](https://github.com/apple/swift/blob/master/docs/SIL.rst),
 200 and described in
 201 [a talk on YouTube](https://www.youtube.com/watch?v=Ntj8ab-5cvE). The section of
 202 interest
 203 [starts here](https://www.google.com/url?q=https://youtu.be/Ntj8ab-5cvE?t%3D596&sa=D&ust=1529450150971000&usg=AFQjCNFQHEWL7m8q3eO-1DiKw9zqC2v24Q).
 204
 205 ### Index type disallowed in vector/tensor/memref types
 206
 207 Index types are not allowed as elements of `vector`, `tensor` or `memref` type.
 208 Index types are intended to be used for platform-specific "size" values and may
 209 appear in subscripts, sizes of aggregate types and affine expressions. They are
 210 also tightly coupled with `affine.apply` and affine.load/store operations;
 211 having `index` type is a necessary precondition of a value to be acceptable by
 212 these operations. While it may be useful to have `memref<?xindex>` to express
 213 indirect accesses, e.g. sparse matrix manipulations or lookup tables, it creates
 214 problems MLIR is not ready to address yet. MLIR needs to internally store
 215 constants of aggregate types and emit code operating on values of those types,
 216 which are subject to target-specific size and alignment constraints.  Since MLIR
 217 does not have a target description mechanism at the moment, it cannot reliably
 218 emit such code. Moreover, some platforms may not support vectors of type
 219 equivalent to `index`.
 220
 221 Indirect access use cases can be alternatively supported by providing and
 222 `index_cast` instruction that allows for conversion between `index` and
 223 fixed-width integer types, at the SSA value level. It has an additional benefit
 224 of supporting smaller integer types, e.g. `i8` or `i16`, for small indices
 225 instead of (presumably larger) `index` type.
 226
 227 ### Bit width of a non-primitive types and `index` is undefined
 228
 229 The bit width of a compound type is not defined by MLIR, it may be defined by a
 230 specific lowering pass. In MLIR, bit width is a property of certain primitive
 231 _type_, in particular integers and floats. It is equal to the number that
 232 appears in the type definition, e.g. the bit width of `i32` is `32`, so is the
 233 bit width of `f32`. The bit width is not _necessarily_ related to the amount of
 234 memory (in bytes) or the size of register (in bits) that is necessary to store
 235 the value of the given type. These quantities are target and ABI-specific and
 236 should be defined during the lowering process rather than imposed from above.
 237 For example, `vector<3xi57>` is likely to be lowered to a vector of four 64-bit
 238 integers, so that its storage requirement is `4 x 64 / 8 = 32` bytes, rather
 239 than `(3 x 57) ceildiv 8 = 22` bytes as can be naively computed from the
 240 bitwidth. Individual components of MLIR that allocate space for storing values
 241 may use the bit size as the baseline and query the target description when it is
 242 introduced.
 243
 244 The bit width is not defined for dialect-specific types at MLIR level. Dialects
 245 are free to define their own quantities for type sizes.
 246
 247 ### Signless types
 248
 249 Integers in the builtin MLIR type system have a bitwidth (note that the `index`
 250 type has a symbolic width equal to the machine word size), but they do not have
 251 an intrinsic sign. This means that the "standard ops" operation set has things
 252 like `addi` and `muli` which do two's complement arithmetic, but some other
 253 operations get a sign, e.g. `divis` vs `diviu`.
 254
 255 LLVM uses the [same design](http://llvm.org/docs/LangRef.html#integer-type),
 256 which was introduced in a revamp rolled out
 257 [in the LLVM 2.0 integer type](http://releases.llvm.org/2.0/docs/LangRef.html#t_derived).
 258 Prior to that, from
 259 [LLVM 1.0](http://releases.llvm.org/1.0/docs/LangRef.html#t_classifications) to
 260 [1.9](http://releases.llvm.org/1.9/docs/LangRef.html#t_classifications), LLVM
 261 uses signed types like "sbyte" and "ubyte". This shift was important and has
 262 served LLVM well over the years. The reason this is important is that it is a
 263 good thing for an intermediate representation to represent the same computation
 264 with the same instruction. Signed types got in the way, because (e.g.) an "add
 265 of an sbyte" does the same computation as an "add of a ubyte", but the type
 266 system made them look artificially different. This split also required casts
 267 like "cast from sbyte to ubyte" which do nothing at the machine level. Removing
 268 signs from the type system eliminated these problems, making the compiler
 269 simpler.
 270
 271 More information about this split is available in an old
 272 [talk on youtube](https://www.youtube.com/watch?v=VeRaLPupGks) talking about
 273 LLVM 2.0.
 274
 275 Note that this rationale only applies to the "standard ops" dialect in which we
 276 can express an opinion about its design. Other dialects generally try to model
 277 an external system, and should aim to reflect its design as closely as possible.
 278
 279 ### Splitting floating point vs integer operations
 280
 281 The MLIR "standard" operation set splits many integer and floating point
 282 operations into different categories, for example `addf` vs `addi` and `cmpf` vs
 283 `cmpi`
 284 ([following the design of LLVM](http://llvm.org/docs/LangRef.html#binary-operations)).
 285 These instructions _are_ polymorphic on the number of elements in the type
 286 though, for example `addf` is used with scalar floats, vectors of floats, and
 287 tensors of floats (LLVM does the same thing with its scalar/vector types).
 288
 289 This split is important because floating point and integer operations are quite
 290 different in practice: for example, floating point values include NaN's, so
 291 [integer comparisons](http://llvm.org/docs/LangRef.html#icmp-instruction) and
 292 [floating point comparisons](http://llvm.org/docs/LangRef.html#fcmp-instruction)
 293 should use different comparison opcodes. On the arithmetic side of things,
 294 floating point operations support rounding modes, floating point contractions,
 295 ["fast math"](http://llvm.org/docs/LangRef.html#fadd-instruction), and integers
 296 may want to have two's complement overflow behavior or be undefined on
 297 [various forms of wrapping](http://llvm.org/docs/LangRef.html#add-instruction)
 298 for performance.
 299
 300 We are a long way from this sort of thing being a priority to care about in
 301 MLIR, but since we have experience and know the right way to do this, we'd
 302 rather design it in from the beginning.
 303
 304 Note that this rationale only applies to the "standard ops" dialect in which we
 305 can express an opinion about its design. Other dialects generally try to model
 306 an external system, and should aim to reflect its design as closely as possible.
 307
 308 ### Specifying sign in integer comparison operations
 309
 310 Since integers are [signless](#signless-types), it is necessary to define the
 311 sign for integer comparison operations. This sign indicates how to treat the
 312 foremost bit of the integer: as sign bit or as most significant bit. For
 313 example, comparing two `i4` values `0b1000` and `0b0010` yields different
 314 results for unsigned (`8 > 3`) and signed (`-8 < 3`) interpretations. This
 315 difference is only significant for _order_ comparisons, but not for _equality_
 316 comparisons. Indeed, for the latter all bits must have the same value
 317 independently of the sign. Since both arguments have exactly the same bit width
 318 and cannot be padded by this operation, it is impossible to compare two values
 319 whose bit representations would differ while the values are interpreted as
 320 equal.
 321
 322 ### Specifying comparison kind as attribute
 323
 324 Unlike arithmetic, comparison operators share several common properties, e.g.
 325 they cannot be considered associative. In practice, comparisons are sometimes
 326 implemented by the same instruction or its variants so it makes sense to group
 327 them together at the IR level.
 328
 329 An alternative would be introducing ten distinct operators for all currently
 330 supported kinds of integer comparisons. These operators would have increased the
 331 number of "reserved" names used by standard operations as well as the size of
 332 the C++ API while their implementations would have been mostly identical.
 333
 334 The comparison kind is internally an integer attribute. However, for the sake of
 335 readability by humans, custom assembly form accepts string literals that are
 336 mapped to the underlying integer values: `cmpi "eq", %lhs, %rhs` better implies
 337 integer equality comparison than `cmpi 0, %lhs, %rhs` where it is unclear what
 338 gets compared to what else. This syntactic sugar is possible thanks to parser
 339 logic redefinitions for custom assembly form of non-builtin operations.
 340 Supporting it in the full notation would have required changing how the main
 341 parsing algorithm works and may have unexpected repercussions. While it had been
 342 possible to store the predicate as string attribute, it would have rendered
 343 impossible to implement switching logic based on the comparison kind and made
 344 attribute validity checks (one out of ten possible kinds) more complex.
 345
 346 ### 'select' operation to implement min/max
 347
 348 Although `min` and `max` operations are likely to occur as a result of
 349 transforming affine loops in ML functions, we did not make them first-class
 350 operations. Instead, we provide the `select` operation that can be combined with
 351 `cmpi` to implement the minimum and maximum computation. Although they now
 352 require two operations, they are likely to be emitted automatically during the
 353 transformation inside MLIR. On the other hand, there are multiple benefits of
 354 introducing `select`: standalone min/max would concern themselves with the
 355 signedness of the comparison, already taken into account by `cmpi`; `select` can
 356 support floats transparently if used after a float-comparison operation; the
 357 lower-level targets provide `select`-like instructions making the translation
 358 trivial.
 359
 360 This operation could have been implemented with additional control flow: `%r =
 361 select %cond, %t, %f` is equivalent to
 362
 363 ```mlir
 364 ^bb0:
 365   cond_br %cond, ^bb1(%t), ^bb1(%f)
 366 ^bb1(%r):
 367 ```
 368
 369 However, this control flow granularity is not available in the ML functions
 370 where min/max, and thus `select`, are likely to appear. In addition, simpler
 371 control flow may be beneficial for optimization in general.
 372
 373 ### Regions
 374
 375 #### Attributes of type 'Block'
 376
 377 We considered representing regions through `ArrayAttr`s containing a list of a
 378 special type `IRBlockAttr`, which in turn would contain a list of operations.
 379 All attributes in MLIR are unique’d within the context, which would make the IR
 380 inside the regions immortal for no good reason.
 381
 382 #### Use "inlined" functions as regions
 383
 384 We considered attaching a "force-inline" attribute on a function and/or a
 385 function `call` operation. Even the minimal region support (use cases in
 386 affine.for and affine.if existing before the regions) requires access to the
 387 values defined in the dominating block, which is not supported by functions.
 388 Conceptually, function bodies are instances of regions rather than the inverse;
 389 regions can also be device kernels, alternative sections, etc.
 390
 391 #### Dedicated `region` operation
 392
 393 This would mean we have a special kind of operation that is allowed to have
 394 regions while other operations are not. Such distinction is similar to the
 395 Stmt/Op difference we have had and chose to remove to make the IR simpler and
 396 more flexible. It would also require analyses and passes to consider the
 397 interplay between operations (e.g., an `affine.for` operation must be followed
 398 by a region operation). Finally, a region operation can be introduced using the
 399 current implementation, among other operations and without being special in any
 400 sense.
 401
 402 #### Explicit capture of the values used in a region
 403
 404 Being able to use values defined outside the region implies that use-def chains
 405 may contain uses from different nested regions. Consequently, IR transformations
 406 and analyses can pull the instruction defining the value across region
 407 boundaries, for example in case of TableGen-defined canonicalization patterns.
 408 This would not be the case if all used values had been passed as region
 409 arguments. One of the motivations for introducing regions in the IR is precisely
 410 to enable cross-region analyses and transformations that are simpler than
 411 inter-procedural transformations. Having uses from different regions appear in
 412 the same use-def chain, contrary to an additional data structure maintaining
 413 correspondence between function call arguments as uses of the original
 414 definitions and formal arguments as new definitions, enables such
 415 simplification. Since individual operations now belong to blocks, which belong
 416 to regions, it is always possible to check if the definition of the value
 417 belongs to the same region as its particular use. The risk is that any IR
 418 traversal will need to handle explicitly this situation and it is easy to forget
 419 a check (or conversely it isn’t easy to design the right check in a tablegen
 420 pattern for example): traversing use-def chains potentially crosses implicitly
 421 semantic barriers, making it possible to unknowingly break region semantics.
 422 This is expected to be caught in the verifier after the transformation.
 423
 424 At the same time, one may choose to pass certain or all values as region
 425 arguments to explicitly break the use-def chains in the current proposal. This
 426 can be combined with an attribute-imposed semantic requirement disallowing the
 427 body of the region to refer to any value from outside it.
 428
 429 ### Quantized integer operations
 430
 431 We haven't designed integer quantized operations in MLIR, but experience from
 432 TensorFlow suggests that it is better to put information about the quantization
 433 range/scale into the type itself, rather than have a single type like "qint8"
 434 and put these on attributes of the operation.
 435
 436 There are a few ways to do this with MLIR, including at least:
 437
 438 *   We could do the same thing TensorFlow does - and we will _have_ to support
 439     that model to some extent for compatibility.
 440 *   We can encode the fp range of quantized integers directly into the types
 441     when they are constants. The best practice on this seems to be to encode the
 442     zero point as well as a scale factor. This ensures that 0.0 is always
 443     exactly representable, e.g. `qi8<-1.42, 31.23x>`.
 444 *   We could theoretically encode dynamically determined ranges into the types
 445     using something like `qi8<?,?>` with the bounds being determined through the
 446     SSA dataflow graph dynamically - similar to how dynamic shapes are handled.
 447
 448 We will definitely need to do #1 for compatibility, we probably want to do #2,
 449 and we should investigate #3 over time. That said, our short term plan is to get
 450 more implementation experience with the rest of the system first, then come back
 451 to re-examine the representation for quantized arithmetic when we have that
 452 experience. When we do, we should chat with benoitjacob@ and
 453 [read the paper](https://arxiv.org/abs/1712.05877).
 454
 455 ### Dialect type extensions
 456
 457 This section describes the design decisions that shaped the dialect extensible
 458 type system present in MLIR.
 459
 460 #### Reserving dialect type kinds
 461
 462 Dialects that wish to define type extensions must reserve a range of type kinds
 463 within a '.def' file within the core IR library. This means that every dialect
 464 wishing to define custom types must modify this file, but it guarantees that all
 465 type casting checkings are performed in O(1) time.
 466
 467 #### Interactions between dialects
 468
 469 There are two different interactions between dialects that are important to
 470 understand. When types of a dialect are:
 471
 472 *   In operations of other dialects
 473
 474     -   For standard/builtin operations, only standard/builtin types are
 475         allowed. This restriction allows for operations to clearly understand
 476         the invariants that they are working under.
 477     -   Outside of standard/builtin operations, dialects are expected to verify
 478         the allowable operation types per operation.
 479
 480 *   In types of other dialects
 481
 482     -   For standard/builtin types, these types are allowed to contain types
 483         from other dialects. This simplifies the type system and removes the
 484         need for dialects to redefine all of the standard aggregate types, e.g.
 485         tensor, as well as the memref type. Dialects are expected to verify that
 486         a specific type is valid within a standard type, e.g. if a type can be
 487         an element of a tensor.
 488     -   For dialect types, the dialect is expected to verify any type
 489         invariants, e.g. if the standard tensor type can contain a specific type
 490         of that dialect.
 491
 492 #### Separating builtin and standard types
 493
 494 Following the separation between the built-in and standard dialect, it makes
 495 sense to separate built-in types and standard dialect types. Built-in types are
 496 required for the validity of the IR itself, e.g. the function type (which
 497 appears in function signatures and generic assembly forms of operations).
 498 Integer, float, vector, memref and tensor types, while important, are not
 499 necessary for IR validity.
 500
 501 #### Unregistered types
 502
 503 MLIR supports unregistered operations in generic assembly form. MLIR also
 504 supports a similar concept for types. When parsing, if the dialect for dialect
 505 type has not been registered the type is modeled as an 'OpaqueType'. This allows
 506 for types to be round-tripped without needing to link in the dialect library
 507 that defined them. No additional information about opaque types, outside of
 508 parsing/printing, will be available.
 509
 510 #### Dialect type syntax
 511
 512 Dialect extended types are represented as string literals wrapped inside of the
 513 dialect namespace. This means that the parser delegates to the dialect for
 514 parsing specific type instances. This differs from the representation of dialect
 515 defined operations, of which have an identifier name that the parser uses to
 516 identify and parse them.
 517
 518 This representation was chosen for several reasons:
 519
 520 ##### Dialects must provide custom type parsers
 521
 522 Dialect type parsing cannot plug into the existing parser infrastructure as
 523 operations do with the OpAsmParser/Printer. Operations have a defined syntax
 524 structure that is the same across all dialects. Types, on the other hand, may
 525 have many different, and sometimes conflicting, parsing constraints that would
 526 be difficult/unmaintainable to provide within a single interface.
 527
 528 This also has the added benefit of encouraging dialects to reuse existing
 529 external type parsers. For example, an LLVM dialect may provide an MLIR LLVM
 530 type that is simply a wrapper around LLVM types. The LLVM dialect would then use
 531 the existing LLVM type parsing infrastructure.
 532
 533 Example:
 534
 535 ```mlir
 536 %s = "foo"() : () -> !llvm<"i32*">
 537 ```
 538
 539 ##### Types do not always have canonical names
 540
 541 Unlike operations, types generally do not have a formal canonical name. For
 542 example, function types have no defined keyword and integer types are defined by
 543 a regular expression to support arbitrary bitwidth. Dialects with existing type
 544 systems, e.g. LLVM, are likely to provide wrappers around their existing type
 545 systems. For these wrapper types there is no simple canonical name, it's logical
 546 to think of these types as existing within the namespace of the dialect. If a
 547 dialect wishes to assign a canonical name to a type, it can be done via
 548 [type aliases](LangRef.md#type-aliases).
 549
 550 ### Tuple types
 551
 552 The MLIR type system provides first class support for defining
 553 [tuple types](LangRef.md#tuple-type). This is due to the fact that `Tuple`
 554 represents a universal concept that is likely to, and has already begun to,
 555 present itself in many different dialects. Though this type is first class in
 556 the type system, it merely serves to provide a common mechanism in which to
 557 represent this concept in MLIR. As such, MLIR provides no standard operations
 558 for interfacing with `tuple` types. It is up to dialect authors to provide
 559 operations, e.g. extract_tuple_element, to interpret and manipulate them. When
 560 possible, operations should prefer to use multiple results instead. These
 561 provide a myriad of benefits, such as alleviating any need for tuple-extract
 562 operations that merely get in the way of analysis and transformation.
 563
 564 ### Assembly forms
 565
 566 MLIR decides to support both generic and custom assembly forms under the
 567 following considerations:
 568
 569 MLIR is an open system; it is designed to support modular and pluggable
 570 dialects. Depending on whether there exists a corresponding dialect and whether
 571 the dialect is plugged in, operations may or may not be registered into MLIR
 572 system. Yet we still need a way to investigate these operations. So the generic
 573 assembly form is mandated by this aspect of MLIR system. It provides a default
 574 textual form for operations.
 575
 576 On the other hand, an assembly form is for assisting developers to investigate
 577 the IR. The generic form serves as a safe fallback but it can be too verbose for
 578 certain ops. Therefore, MLIR gives each dialect the choice to define a custom
 579 assembly form for each operation according to the operation's semantics and
 580 specific needs. The custom assembly form can de-duplicate information from the
 581 operation to derive a more concise form, thus better facilitating the
 582 comprehension of the IR.
 583
 584 ## Examples
 585
 586 This section describes a few very simple examples that help understand how MLIR
 587 represents computation.
 588
 589 ### Non-affine control flow
 590
 591 ```mlir
 592 // A simple linear search in every row of a matrix
 593 for (i = 0; i < N; i++) {
 594   for (j = 0; j < N; j++) {
 595     // dynamic control flow
 596     if (a[i][j] == key) {
 597       s[i] = j;
 598       break;
 599     }
 600   }
 601 }
 602 ```
 603
 604 The presence of dynamic control flow leads to an inner non-affine function
 605 nested in an outer function that using affine loops.
 606
 607 ```mlir
 608 func @search(%A: memref<?x?xi32, %S: <?xi32>, %key : i32) {
 609   %ni = dim %A, 0 : memref<?x?xi32>
 610   // This loop can be parallelized
 611   affine.for %i = 0 to %ni {
 612     call @search_body (%A, %S, %key, %i) : (memref<?x?xi32>, memref<?xi32>, i32, i32)
 613   }
 614   return
 615 }
 616
 617 func @search_body(%A: memref<?x?xi32>, %S: memref<?xi32>, %key: i32, %i : i32) {
 618   %nj = dim %A, 1 : memref<?x?xi32>
 619   br ^bb1(0)
 620
 621 ^bb1(%j: i32)
 622   %p1 = cmpi "lt", %j, %nj : i32
 623   cond_br %p1, ^bb2, ^bb5
 624
 625 ^bb2:
 626   %v = affine.load %A[%i, %j] : memref<?x?xi32>
 627   %p2 = cmpi "eq", %v, %key : i32
 628   cond_br %p2, ^bb3(%j), ^bb4
 629
 630 ^bb3(%j: i32)
 631   affine.store %j, %S[%i] : memref<?xi32>
 632   br ^bb5
 633
 634 ^bb4:
 635   %jinc = addi %j, 1 : i32
 636   br ^bb1(%jinc)
 637
 638 ^bb5:
 639   return
 640 }
 641 ```
 642
 643 As per the [MLIR spec](LangRef.md), the restrictions on dimensions and symbol
 644 identifiers to be used with the affine.apply operation only apply to accesses
 645 inside `affine.for` and `affine.if` operations. However, an analysis of accesses
 646 inside the called function (`@search_body`) is necessary to determine if the
 647 `%i` loop could be parallelized: such function access analysis is calling
 648 context sensitive.
 649
 650 ### Non-affine loop bounds
 651
 652 Loop bounds that are not affine lead to a nesting of functions as shown below.
 653
 654 ```c
 655 for (i = 0; i < N; i++)
 656   for (j = 0; j < N; j++)
 657     // Non-affine loop bound for k loop.
 658     for (k = 0; k < pow(2, j); k++)
 659        for (l = 0; l < N; l++) {
 660         // block loop body
 661         ...
 662        }
 663 ```
 664
 665 ```mlir
 666 func @outer_nest(%n : index) {
 667   affine.for %i = 0 to %n {
 668     affine.for %j = 0 to %n {
 669       %pow = call @pow(2, %j) : (index, index) ->  index
 670       call @inner_nest(%pow, %n) : ...
 671     }
 672   }
 673   return
 674 }
 675
 676 func @inner_nest(%m : index, %n : index) {
 677   affine.for %k = 0 to %m {
 678     affine.for %l = 0 to %n {
 679       ...
 680     }
 681   }
 682   return
 683 }
 684 ```
 685
 686 ### Reference 2D Convolution
 687
 688 The following example illustrates a reference implementation of a 2D
 689 convolution, which uses an integer set `#domain` to represent valid input data
 690 in a dilated convolution.
 691
 692 ```mlir
 693 // Dilation factors S0 and S1 can be constant folded if constant at compile time.
 694 #domain = (d0, d1)[S0,S1,S2,S3]: (d0 % S0 == 0, d1 % S1 == 0, d0 >= 0, d1 >= 0,
 695                                    S3 - d0 - 1 >= 0, S4 - d1 - 1 >= 0)
 696 // Identity map (shown here for illustration).
 697 #map0 = (d0, d1, d2, d3, d4, d5, d6) -> (d0, d1, d2, d3, d4, d5, d6)
 698
 699 // Affine map from output to input coordinate space.
 700 // d0 = output_h, d1 = output_w, d2 = kernel_h, d3 = kernel_w
 701 // S0 = h_stride, S1 = w_stride, S2 = h_kernel_dilation, S3 = w_kernel_dilation
 702 // S4 = h_pad_low, S5 = w_pad_low
 703 //     %out0 =  %0#1 * %h_stride + %0#4 * %h_kernel_dilation - %h_pad_low
 704 //     %out1=  %0#2 * %w_stride + %0#5 * %w_kernel_dilation - %w_pad_low
 705 #map1_0 = (d0, d1, d2, d3) [S0, S1, S2, S3, S4, S5] -> (d0 * S0 + d2 * S2 - %S4)
 706 #map1_1 = (d0, d1, d2, d3) [S0, S1, S2, S3, S4, S5] -> (d1 * S1 + d3 * S3 - %S5)
 707
 708 // Semi-affine map to undilated input coordinate space.
 709 // d0 = input_h, d1 = input_w, S0 = h_base_dilation, S1 = w_base_dilation.
 710 #map2_0 = (d0, d1) [S0, S1] -> (d0 / S0)
 711 #map2_1 = (d0, d1) [S0, S1] -> (d1 / S1)
 712
 713 // Conv2D shapes:
 714 // input:   [batch, input_height, input_width, input_feature]
 715 // kernel: [kernel_height, kernel_width, input_feature, output_feature]
 716 // output: [batch, output_height, output_width, output_feature]
 717 func @conv2d(%input: memref<16x1024x1024x3xf32, #lm0, /*scratchpad=*/1>,
 718              %kernel: memref<5x5x3x32xf32, #lm0, /*scratchpad=*/1>,
 719              %output: memref<16x512x512x32xf32, #lm0, /*scratchpad=*/1>) {
 720   affine.for %b = 0 to %batch {
 721     affine.for %oh = 0 to %output_height {
 722       affine.for %ow = 0 to %output_width {
 723         affine.for %of = 0 to %output_feature {
 724           affine.for %kh = 0 to %kernel_height {
 725             affine.for %kw = 0 to %kernel_width {
 726               affine.for %if = 0 to %input_feature {
 727                 // Calculate input indices.
 728                 %1_0 = affine.apply #map1_0 (%0#1, %0#2, %0#4, %0#5)
 729                   [%h_stride, %w_stride, %h_kernel_dilation, %w_kernel_dilation,
 730                    %h_pad_low, %w_pad_low]
 731                 %1_1 = affine.apply #map1_1 (%0#1, %0#2, %0#4, %0#5)
 732                   [%h_stride, %w_stride, %h_kernel_dilation, %w_kernel_dilation,
 733                    %h_pad_low, %w_pad_low]
 734
 735                 // Check if access is not in padding.
 736                 affine.if #domain(%1_0, %1_1)
 737                                        [%h_base_dilation, %w_kernel_dilation, %h_bound, %w_bound] {
 738                   %2_0 = affine.apply #map2 (%1_0, %1_1)
 739                   %2_1 = affine.apply #map2 (%1_0, %1_1)
 740                   // Compute: output[output_indices] += input[input_indices] * kernel[kernel_indices]
 741                   call @multiply_accumulate(%input, %kernel, %output, %b, %oh, %ow, %of, %kh, %kw, %if, %2_0, %2_1)
 742                 }
 743               }
 744             }
 745           }
 746         }
 747       }
 748     }
 749   }
 750   return
 751 }
 752 ```
 753
 754 TODO (Add more examples showing the IR for a variety of interesting cases)
 755
 756 ## Design alternatives and extensions
 757
 758 This is a list of some design alternatives and extensions that we discussed in
 759 detail but did not include in the spec or postponed them for future
 760 consideration on demand. We will revisit these discussions when we have more
 761 implementation experience and learn more about the challenges and limitations of
 762 our current design in practice.
 763
 764 ### Polyhedral code representation alternatives: schedule lists vs schedules trees vs affine loop/if forms
 765
 766 The current MLIR uses a representation of polyhedral schedules using a tree of
 767 if/for loops. We extensively debated the tradeoffs involved in the typical
 768 unordered polyhedral instruction representation (where each instruction has
 769 multidimensional schedule information), discussed the benefits of schedule tree
 770 forms, and eventually decided to go with a syntactic tree of affine if/else
 771 conditionals and affine for loops. Discussion of the tradeoff was captured in
 772 this document:
 773 [ MLIR: The case for a simplified polyhedral form](RationaleSimplifiedPolyhedralForm.md).
 774
 775 At a high level, we have two alternatives here:
 776
 777 1.  Schedule tree representation instead of an affine loop AST form: The current
 778     proposal uses an affine loop and conditional tree form, which is syntactic
 779     and with no separation of domains as sets and schedules as multidimensional
 780     affine functions. A schedule tree form however makes polyhedral domains and
 781     schedules a first class concept in the IR allowing compact expression of
 782     transformations through the schedule tree without changing the domains of
 783     instructions. Such a representation also hides prologues, epilogues, partial
 784     tiles, complex loop bounds and conditionals making loop nests free of
 785     "syntax". Cost models instead look at domains and schedules. In addition, if
 786     necessary such a domain schedule representation can be normalized to
 787     explicitly propagate the schedule into domains and model all the cleanup
 788     code. An example and more detail on the schedule tree form is in the next
 789     section.
 790 1.  Having two different forms of "affine regions": an affine loop tree form
 791     and a polyhedral schedule tree form. In the latter, ops could carry
 792     attributes capturing domain, scheduling, and other polyhedral code
 793     generation options with IntegerSet, AffineMap, and other attributes.
 794
 795 #### Schedule Tree Representation for Affine Regions
 796
 797 This representation is based on a simplified form of the domain/schedule
 798 representation used by the polyhedral compiler community. Domains represent what
 799 has to be executed while schedules represent the order in which domain elements
 800 are interleaved. We model domains as non-piece-wise convex integer sets, and
 801 schedules as affine functions; however, the former can be disjunctive, and the
 802 latter can be piece-wise affine relations. In the schedule tree representation,
 803 domain and schedules for instructions are represented in a tree-like structure
 804 which is called a schedule tree. Each non-leaf node of the tree is an abstract
 805 polyhedral dimension corresponding to an abstract fused loop for each ML
 806 instruction that appears in that branch. Each leaf node is an ML Instruction.
 807
 808 ```mlir
 809 // A tiled matmul code (128x128x128) represented in schedule tree form
 810
 811 // #map0 = (d0, d1, d2, d3, d4, d5) -> (128*d0 + d3, 128*d1 + d4, 128*d2 + d5)
 812 #intset_ij = (i, j) [M, N, K]  : i >= 0, -i + N - 1 >= 0, j >= 0, -j + N-1 >= 0
 813 #intset_ijk = (i, j, k) [M, N, K] : i >= 0, -i + N - 1 >= 0, j >= 0,
 814                                      -j +  M-1 >= 0, k >= 0, -k + N - 1 >= 0)
 815 func @matmul(%A, %B, %C, %M, %N, %K) : (...)  { // %M, N, K are symbols
 816   // t1, t2, t3, t4, t5, t6  are abstract polyhedral loops
 817   mldim %t1 : {S1,S2,S3,S4,S5}  floordiv (i, 128) {
 818     mldim %t2 : {S1,S2,S3,S4,S5}  floordiv (j, 128) {
 819       // (%i, %j) = affine.apply (d0, d1) -> (128*d0, 128*d1) (%t1, %t2)
 820       call dma_mem_to_scratchpad(%C, %i, %j, %M, %N, %K)
 821           with @intset_ij(%i, %j) [%M, %N, %K]
 822       mldim %t3 :   {S2,S3,S4,S5} floordiv (k, 128) {
 823         // (%i, %j, %k) = affine.apply (d0, d1, d2)
 824         //                          -> (128*d0, 128*d1, 128*d2)  (%t1, %t2, %t3)
 825         call dma_mem_to_scratchpad(%A, ...) with #inset_ijk (%i, %j, %k) [%M, %N, %K]
 826         // (%i, %j, %k) = affine.apply (d0, d1, d2)
 827         //                          -> (128*d0, 128*d1, 128*d2)  (%t1, %t2, %t3)
 828         call dma_mem_to_scratchpad(%B, ...) with #inset_ijk (%i, %j, %k) [%M, %N, %K]
 829         mldim %t4 : {S4} i mod 128 {
 830           mldim %t5 : {S4} j mod 128 {
 831             mldim %t6 : {S4} k mod 128 {
 832               // (%i, %j, %k) = affine.apply #map0 (%t1, %t2, %t3, %t4, %t5, %t6)
 833               call matmul_body(A, B, C, %i, %j, %k, %M, %N, %K)
 834                   with #inset_ijk(%i, %j, %k) [%M, %N, %K]
 835             } // end mld4im t6
 836           } // end mldim t5
 837         } // end mldim t4
 838       } // end mldim t3
 839       // (%i, %j) = affine.apply (d0, d1) -> (128*d0, 128*d1) (%t1, %t2)
 840       call $dma_scratchpad_to_mem_C ... with #intset(%i, %j) [%M, %N, %K]
 841     }  // end mldim t2
 842   } // end mldim t1
 843   return
 844 }
 845
 846 ```
 847
 848 ### Affine Relations
 849
 850 The current MLIR spec includes affine maps and integer sets, but not affine
 851 relations. Affine relations are a natural way to model read and write access
 852 information, which can be very useful to capture the behavior of opaque external
 853 library calls, high-performance vendor libraries, or user-provided / user-tuned
 854 routines.
 855
 856 An affine relation is a relation between input and output dimension identifiers
 857 while being symbolic on a list of symbolic identifiers and with affine
 858 constraints on the identifiers.
 859
 860 Syntax:
 861
 862 ```
 863 // Affine relation definition at the top of file
 864 affine-rel-def ::= affine-rel-id `=` affine-relation-inline
 865
 866 affine-rel-id ::= `##` prefixed-id
 867
 868 affine-relation-inline ::=
 869        `(` input-dims `)` (`[` symbols `]`)? `->`
 870        `(` output-dims `)` :  affine-constraint-conjunction
 871
 872 input-dims ::= bare-id-list
 873 output-dims ::= bare-id-list
 874 symbols ::= bare-id-list
 875
 876 affine-rel ::= affine-rel-id | affine-relation-inline
 877
 878 // Usage
 879 affine-rel-spec ::= affine-rel dim-and-symbol-use-list
 880 ```
 881
 882 All identifiers appearing in input-dims, output-dims, and symbol-dims are
 883 pairwise distinct. All affine-constraint non-terminals in the above syntax are
 884 allowed to contain identifiers only from input-dims, output-dims, and
 885 symbol-dims.
 886
 887 Affine relations are used to model read, write, may_read, and may_write sets of
 888 functions in the IR. The output dimension identifiers correspond to the data
 889 dimensions.
 890
 891 Example:
 892
 893 ```mlir
 894 // read relation: two elements ( d0 <= r0 <= d0+1 )
 895 ##aff_rel9 = (d0) -> (r0) : r0 - d0 >= 0, d0 - r0 + 1 >= 0
 896
 897 func @count (%A : memref<128xf32>, %pos : i32) -> f32
 898   reads: {%A ##aff_rel9 (%pos)}
 899   writes: /* empty */
 900   may_reads: /* empty */
 901   may_writes: /* empty */ {
 902 bb0 (%0, %1: memref<128xf32>, i64):
 903   %val = affine.load %A [%pos]
 904   %val = affine.load %A [%pos + 1]
 905   %p = mulf %val, %val : f32
 906   return %p : f32
 907 }
 908 ```
 909
 910 ### Regions
 911
 912 #### Making function definition an operation
 913
 914 MLIR supports values of a Function type. Instead of having first-class IR
 915 concept for functions, one could define an operation with a body region that
 916 defines a function value. The particularity of functions is that their names are
 917 globally visible and can be referred to before being defined, unlike SSA values
 918 that must be defined first. Implementing a "function definition" operation would
 919 require to relax some of the SSA constraints in a region, and also make the IR
 920 Module a region as well. It would also affect the core infrastructure (e.g.,
 921 function passes) only for the sake of concept unification.
 922
 923 #### Having types on a region
 924
 925 Instead of inspecting the types of arguments of the first block, one could give
 926 the region itself a type. This type would be redundant with block argument
 927 types, which must have values and create room for type mismatches. While
 928 functions do have types that are partly redundant with the arguments of the
 929 first block in the function, this is necessary to support function declarations
 930 that do not have a body which we can refer to in order to obtain the argument
 931 types. A region is always contained in an operation or a function that can be
 932 queried to obtain the “type” of the region if necessary.
 933
 934 A type on a region can be justified if Regions were to be considered separately
 935 from the enclosing entity (operation or function) and had their own semantics
 936 that should be checked.
 937
 938 #### Attaching attributes to regions
 939
 940 Regions could be annotated with dialect attributes to use attribute verification
 941 hooks. An operation could take multiple regions as arguments, and each of them
 942 may require different attributes. However, there are currently very few
 943 practical cases where this would be necessary. Instead, one could simulate
 944 per-region attributes with array attributes attached to the entity containing
 945 the region (operation or function). This decreases the overall complexity of the
 946 IR and enables more concise and op-specific forms, e.g., when all regions of an
 947 op have the same attribute that can be only mentioned once. Since the semantics
 948 of the region is entirely defined by the enclosing entity, it also makes sense
 949 to have attributes attached to that entity rather than to the region itself.
 950
 951 This can be reconsidered in the future if we see a non-neglectable amount of use
 952 cases.
 953
 954 ### Read/Write/May_Read/May_Write sets for External Functions
 955
 956 Having read, write, may_read, and may_write sets for external functions which
 957 include opaque ones, high-performance vendor libraries such as CuDNN, CuB, MKL,
 958 FFT libraries, user-provided/optimized functions, or data movement runtimes such
 959 as DMA ones is a powerful feature. It allows the compiler to perform analysis,
 960 composition/transformation in the presence of such calls and with loops around
 961 such calls on sub-tensors. For user-provided or custom hand-tuned functions, the
 962 read/write/may_read/may_write sets could be provided a-priori by a user as part
 963 of the external function signature or they could be part of a database.
 964
 965 TODO: Design this, and update to use function attribute syntax.
 966
 967 Example:
 968
 969 ```mlir
 970 ##rel9 ( ) [s0] -> (r0, r1) : 0 <= r0 <= 1023, 0 <= r1 <= s0 - 1
 971
 972 func @cblas_reduce_ffi(%M: memref<1024 x ? x f32, #layout_map0, /*mem=*/0>)
 973   -> f32 [
 974   reads: {%M, ##rel9() }
 975   writes: /* empty */
 976   may_reads: /* empty */
 977   may_writes: /* empty */
 978 ]
 979
 980 func @dma_mem_to_scratchpad(%a : memref<1024 x f32, #layout_map0, /*mem=*/0>,
 981     %b : memref<1024 x f32, #layout_map0, 1>, %c : memref<1024 x f32,
 982     #layout_map0>) [
 983   reads: {%M, ##rel9() }
 984   writes: /* empty */
 985   may_reads: /* empty */
 986   may_writes: /* empty */
 987  ]
 988
 989 ```
 990
 991 ### Memref Extensions
 992
 993 1.  Arbitrary polyhedral shapes for tensors: e.g., triangular shapes in tensor
 994     dimensions where there is symmetry: use integer set (affine constraints) to
 995     model tensor data space (instead of just extents). Requires some changes to
 996     the IR and the in-memory form.
 997 1.  Layout maps
 998
 999     1.  Allow piece-wise affine maps for layouts: allows clean modeling of
1000         boundary cases for images/tensors through padding, wrapping, mirroring,
1001         padding where padded values are the results of computation as opposed to
1002         data, padding in the interior as opposed to just boundaries.
1003     1.  Allow many-to-one layout maps: Index and layout maps in the current
1004         proposal are bijective. Extending them to many-to-one layout maps allows
1005         cleaner(?) modeling of broadcast/reduce style computations while reusing
1006         memory.
1007
1008     Proposal 2(a) requires non-trivial changes to the IR and the in-memory
1009     representation. 2(b) requires no change, but impacts how cost models look at
1010     index and layout maps.
1011
1012 ### `affine.if` and `affine.for` Extensions for "Escaping Scalars"
1013
1014 We considered providing a representation for SSA values that are live out of
1015 `if/else` conditional bodies and loop carried in `affine.for` loops. We
1016 ultimately abandoned this approach due to its complexity. In the current design
1017 of MLIR, scalar variables cannot escape for loops or if instructions. In
1018 situations, where escaping is necessary, we use zero-dimensional tensors and
1019 memrefs instead of scalars.
1020
1021 **TODO**: This whole section is obsolete and should be updated to use block
1022 arguments and a yield like terminator in for/if instructions.
1023
1024 The abandoned design of supporting escaping scalars is as follows:
1025
1026 #### affine.for Instruction
1027
1028 Syntax:
1029
1030 ```
1031 [<out-var-list> =]
1032 for %<index-variable-name> = <lower-bound> ... <upper-bound> step <step>
1033    [with <in-var-list>] { <loop-instruction-list> }
1034 ```
1035
1036 out-var-list is a comma separated list of SSA values defined in the loop body
1037 and used outside the loop body. in-var-list is a comma separated list of SSA
1038 values used inside the loop body and their initializers. loop-instruction-list
1039 is a list of instructions that may also include a yield instruction.
1040
1041 Example:
1042
1043 ```mlir
1044 // Return sum of elements in 1-dimensional mref A
1045 func i32 @sum(%A : memref<?xi32>, %N : i32) -> (i32) {
1046    %init = 0
1047    %result = affine.for %i = 0 to N with %tmp(%init) {
1048       %value = affine.load %A[%i]
1049       %sum = %value + %tmp
1050       yield %sum
1051    }
1052    return %result : i32
1053 }
1054 ```
1055
1056 #### affine.if/else Instruction
1057
1058 Syntax:
1059
1060 ```
1061 <out-var-list> = affine.if (<cond-list>) {...} [else {...}]
1062 ```
1063
1064 Out-var-list is a list of SSA values defined by the if-instruction. The values
1065 are arguments to the yield-instruction that occurs in both then and else clauses
1066 when else clause is present. When if instruction contains only if clause, the
1067 escaping value defined in the then clause should be merged with the value the
1068 variable had before the if instruction. The design captured here does not handle
1069 this situation.
1070
1071 Example:
1072
1073 ```mlir
1074 // Compute sum of half of the array
1075 func i32 @sum_half(%A : memref<?xi32>, %N : i32) -> (i32) {
1076    %s0 = 0
1077    %s1 = affine.for %i = 1 ... N step 1 with %s2 (%s0) {
1078        %s3 = if (%i >= %N / 2) {
1079           %v0 = affine.load %A[%i]
1080           %s4 = %s2 + %v0
1081           yield %s4
1082        }
1083        yield %s3
1084    }
1085    return %s1 : i32
1086 }
1087 ```
1088
1089 ### Multithreading the compiler
1090
1091 People want compilers to go fast, and one simple way to do that is to
1092 multi-thread them. There are multiple strategies for this, but a simple one is
1093 to optimize and compile separate functions in parallel. LLVM's original pass
1094 manager anticipated this demand, and the CallGraphSCCPass manager is even
1095 designed to support this as well, but unfortunately, a few early design
1096 decisions in LLVM prevent this from ever happening. Instead, things like ThinLTO
1097 are forced to split programs into separate LLVM modules/context and optimize
1098 those chunks independently.
1099
1100 The problem is that LLVM has several objects in its IR that are globally uniqued
1101 and also mutable: notably constants like `i32 0`. In LLVM, these constants are
1102 `Value`'s, which allow them to be used as operands to instructions, and that
1103 they also have SSA use lists. Because these things are uniqued, every `i32 0` in
1104 any function shares a use list. This means that optimizing multiple functions in
1105 parallel won't work (at least without some sort of synchronization on the use
1106 lists, which would be unbearably inefficient).
1107
1108 MLIR now supports a multithreaded pass manager. We do this through several
1109 design choices:
1110
1111 1.  MLIR makes use of extensive uniqued immutable data structures (affine
1112     expressions, types, etc are all immutable, uniqued, and immortal).
1113 2.  Constants are defined in per-function pools, instead of being globally
1114     uniqued.
1115 3.  Functions themselves are not SSA values either, so they don't have the same
1116     problem as constants.
1117 4.  FunctionPasses are copied (through their copy ctor) into one instance per
1118     thread, avoiding sharing of local state across threads.
1119
1120 This allows MLIR function passes to support efficient multithreaded compilation
1121 and code generation.