mlir/docs/Rationale/Rationale.md

   1 # MLIR Rationale
   2
   3 This document is intended to capture some of the alternatives considered and
   4 open debates in the design of MLIR, along with the rationale for certain
   5 decisions we made. This is not intended to be a "finely groomed" document - we
   6 prefer the ability to dump in interesting tidbits without worrying too much
   7 about their consistency or readability.
   8
   9 [TOC]
  10
  11 ## Abstract
  12
  13 MLIR is a compiler intermediate representation with similarities to traditional
  14 three-address SSA representations (like
  15 [LLVM IR](http://llvm.org/docs/LangRef.html) or
  16 [SIL](https://github.com/apple/swift/blob/main/docs/SIL.rst)), but which
  17 introduces notions from the polyhedral loop optimization works as first class
  18 concepts. This hybrid design is optimized to represent, analyze, and transform
  19 high level dataflow graphs as well as target-specific code generated for high
  20 performance data parallel systems. Beyond its representational capabilities, its
  21 single continuous design provides a framework to lower from dataflow graphs to
  22 high performance target specific code.
  23
  24 MLIR stands for one of "Multi-Level IR" or "Multi-dimensional Loop IR" or
  25 "Machine Learning IR" or "Mid Level IR", we prefer the first. This document only
  26 provides the rationale behind MLIR -- its actual
  27 [specification document](../LangRef.md) and other content is hosted elsewhere.
  28
  29 ## Introduction and Motivation
  30
  31 The Multi-Level Intermediate Representation (MLIR) is intended for easy
  32 expression and optimization of computations involving deep loop nests and dense
  33 matrices of high dimensionality. It is thus well-suited to deep learning
  34 computations in particular. Yet it is general enough to also represent arbitrary
  35 sequential computation. The representation allows high-level optimization and
  36 parallelization for a wide range of parallel architectures including those with
  37 deep memory hierarchies --- general-purpose multicores, GPUs, and specialized
  38 neural network accelerators.
  39
  40 MLIR uses ideas drawn from IRs of LLVM and Swift for lower level constructs
  41 while combining them with ideas from the polyhedral abstraction to represent
  42 loop nests, multidimensional data (tensors), and transformations on these
  43 entities as first class concepts in the IR.
  44
  45 MLIR is a multi-level IR, i.e., it represents code at a domain-specific
  46 representation such as HLO or TensorFlow graphs, all the way down to the machine
  47 level. MLIR is able to represent arbitrary control flow and arbitrary data
  48 accesses, and is general enough to represent nearly all sequential computation.
  49 This is a key distinction from existing polyhedral representation
  50 implementations (such as LLVM [Polly](https://polly.llvm.org/)) that are able to
  51 use the polyhedral abstraction in a way isolated from the LLVM IR and only for
  52 affine loop nests, i.e., portions of the code where array accesses, loop bounds,
  53 and conditionals are regular (involve linear functions of loop iterators and
  54 constant symbols). The presence of statically unpredictable data accesses or
  55 control flow does not preclude representation in MLIR, but only limits to a
  56 certain extent the ability to reason about and apply transformations using the
  57 polyhedral abstraction.
  58
  59 Maps, sets, and relations with affine constraints are the core structures
  60 underlying a polyhedral representation of high-dimensional loop nests and
  61 multidimensional arrays. These structures are represented as textual expressions
  62 in a form close to their mathematical form. These structures are used to capture
  63 loop nests, tensor data structures, and how they are reordered and mapped for a
  64 target architecture. All structured or "conforming" loops are captured as part
  65 of the polyhedral information, and so are tensor variables, their layouts, and
  66 subscripted accesses to these tensors in memory.
  67
  68 The information captured in the IR allows a compact expression of all loop
  69 transformations, data remappings, explicit copying necessary for explicitly
  70 addressed memory in accelerators, mapping to pre-tuned expert-written
  71 primitives, and mapping to specialized vector instructions. Loop transformations
  72 that can be easily implemented include the body of affine transformations: these
  73 subsume all traditional loop transformations (unimodular and non-unimodular)
  74 such as loop tiling, interchange, permutation, skewing, scaling, relative
  75 shifting, reversal, fusion, and distribution/fission. Transformations on data
  76 layout such as padding and transforming to blocked layouts are also represented
  77 well via affine layout maps.
  78
  79 MLIR's design allows a progressive lowering to target-specific forms. Besides
  80 high-level transformations for loop nests and data layouts that a typical
  81 mid-level optimizer is expected to deal with, MLIR is also designed to perform
  82 certain low-level scheduling and mapping decisions that a typical backend IR is
  83 entrusted with: these include mapping to specialized vector instructions,
  84 auto-vectorization, and software pipelining. The need to support these
  85 transformations stems from the fact that neural network accelerators have
  86 specialized units that deal with large chunks of data whose computation maps
  87 back to chunks of more than one loop of the loop nests as viewed by a program at
  88 a level closer to the original specification. Such specialized units or
  89 instructions operate on multidimensional data chunks from a programmer's
  90 viewpoint. It thus makes it hard or infeasible for a backend operating on a very
  91 low-level IR close to assembly to lift and reconstruct loops and perform such a
  92 mapping. This is in contrast to classic instruction selection and scheduling in
  93 today's compilers that primarily only deals with the body of the innermost loop.
  94 MLIR also facilitates automatic mapping to expert pre-tuned primitives or vendor
  95 libraries operating on data at higher levels (or at the highest level) of the
  96 memory hierarchy.
  97
  98 In summary, MLIR is convenient for and closed under the kind of transformations
  99 needed to lower to general-purpose as well as specialized accelerators. It also
 100 allows one to build modular and reusable target independent and target dependent
 101 passes.
 102
 103 ## Design Decisions
 104
 105 This section sheds light on some of the design decisions -- some of these are
 106 indirectly implied by the specification document.
 107
 108 ### Loads and stores
 109
 110 The 'load' and 'store' instructions are specifically crafted to fully resolve to
 111 an element of a memref. These instructions take as arguments n+1 indices for an
 112 n-ranked tensor. This disallows the equivalent of pointer arithmetic or the
 113 ability to index into the same memref in other ways (something which C arrays
 114 allow for example). Furthermore, for the affine constructs, the compiler can
 115 follow use-def chains (e.g. through
 116 [affine.apply operations](../Dialects/Affine.md/#affineapply-affineapplyop) or
 117 through the map attributes of
 118 [affine operations](../Dialects/Affine.md/#operations)) to precisely analyze
 119 references at compile-time using polyhedral techniques. This is possible because
 120 of the
 121 [restrictions on dimensions and symbols](../Dialects/Affine.md/#restrictions-on-dimensions-and-symbols).
 122
 123 A scalar of element-type (a primitive type or a vector type) that is stored in
 124 memory is modeled as a 0-d memref. This is also necessary for scalars that are
 125 live out of for loops and if conditionals in a function, for which we don't yet
 126 have an SSA representation --
 127 [an extension](#affineif-and-affinefor-extensions-for-escaping-scalars) to allow
 128 that is described later in this doc.
 129
 130 ### Symbols and types
 131
 132 The current MLIR disallows use of symbols in types. For example, when a tensor
 133 or memref dimension is statically unknown, it is denoted in the type as '?'. An
 134 SSA symbol is then bound to it when a memref is created. The actual value of the
 135 unknown dimension can be queried using the "dim" builtin as shown below.
 136
 137 Example:
 138
 139 ```mlir
 140 func.func foo(...) {
 141   %A = memref.alloc <8x?xf32, #lmap> (%N)
 142   ...
 143   call bar(%A) : (memref<8x?xf32, #lmap>)
 144 }
 145
 146 func.func bar(%A : memref<8x?xf32, #lmap>) {
 147   // Type of %A indicates that %A has dynamic shape with 8 rows
 148   // and unknown number of columns. The number of columns is queried
 149   // dynamically using dim instruction.
 150   %N = memref.dim %A, 1 : memref<8x?xf32, #lmap>
 151
 152   affine.for %i = 0 to 8 {
 153     affine.for %j = 0 to %N {
 154       // A[i,j] += 1
 155       %s1 = affine.load %A[%i, %j] : memref<8x?xf32, #lmap>
 156       %s2 = add %s1, 1
 157       affine.store %s2, %A[%i, %j] : memref<8x?xf32, #lmap>
 158     }
 159   }
 160   return
 161 }
 162
 163 ```
 164
 165 An alternative design is to embed the reference to symbols directly in the
 166 type - memref<8x%Nxf32>. We went for the current approach in MLIR because it
 167 simplifies the design --- types remain immutable when the values of symbols
 168 change.
 169
 170 ### Block Arguments vs PHI nodes
 171
 172 MLIR Regions represent SSA using "[block arguments](../LangRef.md/#blocks)"
 173 rather than [PHI instructions](http://llvm.org/docs/LangRef.html#i-phi) used in
 174 LLVM. This choice is representationally identical (the same constructs can be
 175 represented in either form) but block arguments have several advantages:
 176
 177 1.  LLVM PHI nodes always have to be kept at the top of a block, and
 178     transformations frequently have to manually skip over them. This is defined
 179     away with BB arguments.
 180 1.  LLVM has a separate function Argument node. This is defined away with BB
 181     arguments, because the arguments to the entry block serve this purpose.
 182 1.  Blocks of PHI nodes in LLVM execute atomically, which is surprising and
 183     super confusing to compiler engineers and it is easy to introduce bugs with
 184     this (very related to the
 185     "[lost copy](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.524.5461&rep=rep1&type=pdf)"
 186     problem in SSA lowering literature.) With the BB argument representation,
 187     this confusion is defined away.
 188 1.  The entry list of PHI nodes in LLVM are unordered, and some blocks have
 189     thousands of predecessors (e.g. unwind blocks). This can cause long compile
 190     time problems because transformations have to linearly scan this list. This
 191     is defined away with BB argument representation.
 192 1.  LLVM has no way to represent values that are available only in one successor
 193     but not the other, e.g. its invoke instruction cannot produce the exception
 194     value JUST on the exception edge. Instead, the
 195     [landingpad instruction](http://llvm.org/docs/LangRef.html#landingpad-instruction)
 196     is a hack used to represent this. MLIR doesn't make use of this capability,
 197     but SIL uses it extensively, e.g. in the
 198     [switch_enum instruction](https://github.com/apple/swift/blob/main/docs/SIL.rst#switch-enum).
 199
 200 For more context, block arguments were previously used in the Swift
 201 [SIL Intermediate Representation](https://github.com/apple/swift/blob/main/docs/SIL.rst),
 202 and described in
 203 [a talk on YouTube](https://www.youtube.com/watch?v=Ntj8ab-5cvE). The section of
 204 interest
 205 [starts here](https://www.youtube.com/watch?v=Ntj8ab-5cvE&t=596s).
 206
 207 ### Index type usage and limitations
 208
 209 Index types are intended to be used for platform-specific "size" values and may
 210 appear in subscripts, sizes of aggregate types and affine expressions. They are
 211 also tightly coupled with `affine.apply` and affine.load/store operations;
 212 having `index` type is a necessary precondition of a value to be acceptable by
 213 these operations.
 214
 215 We allow `index` types in tensors, vectors, and memrefs as a code generation
 216 strategy has to map `index` to an implementation type and hence needs to be able
 217 to materialize corresponding values. However, the target might lack support for
 218 `vector` values with the target specific equivalent of the `index` type.
 219
 220 ### Data layout of non-primitive types
 221
 222 Data layout information such as the bit width or the alignment of types may be
 223 target and ABI-specific and thus should be configurable rather than imposed by
 224 the compiler. Especially, the layout of compound or `index` types may vary. MLIR
 225 specifies default bit widths for certain primitive *types*, in particular for
 226 integers and floats. It is equal to the number that appears in the type
 227 definition, e.g. the bit width of `i32` is `32`, so is the bit width of `f32`.
 228 The bit width is not *necessarily* related to the amount of memory (in bytes) or
 229 the register size (in bits) that is necessary to store the value of the given
 230 type. For example, `vector<3xi57>` is likely to be lowered to a vector of four
 231 64-bit integers, so that its storage requirement is `4 x 64 / 8 = 32` bytes,
 232 rather than `(3 x 57) ceildiv 8 = 22` bytes as can be naively computed from the
 233 bit width. MLIR makes such [data layout information](../DataLayout.md)
 234 configurable using attributes that can be queried during lowering, for example,
 235 when allocating a compound type.
 236
 237 The data layout of dialect-specific types is undefined at MLIR level. Yet
 238 dialects are free to define their own quantities and make them available via the
 239 data layout infrastructure.
 240
 241 ### Integer signedness semantics
 242
 243 Integers in the builtin MLIR type system have a bitwidth (note that the `index`
 244 type has a symbolic width equal to the machine word size), and they *may*
 245 additionally have signedness semantics. The purpose is to satisfy the needs of
 246 different dialects, which can model different levels of abstractions. Certain
 247 abstraction, especially closer to source language, might want to differentiate
 248 signedness with integer types; while others, especially closer to machine
 249 instruction, might want signless integers. Instead of forcing each abstraction
 250 to adopt the same integer modelling or develop its own one in house, Integer
 251 type provides this as an option to help code reuse and consistency.
 252
 253 For the standard dialect, the choice is to have signless integer types. An
 254 integer value does not have an intrinsic sign, and it's up to the specific op
 255 for interpretation. For example, ops like `arith.addi` and `arith.muli` do two's
 256 complement arithmetic, but some other operations get a sign, e.g. `arith.divsi`
 257 vs `arith.divui`.
 258
 259 LLVM uses the [same design](http://llvm.org/docs/LangRef.html#integer-type),
 260 which was introduced in a revamp rolled out
 261 [in the LLVM 2.0 integer type](http://releases.llvm.org/2.0/docs/LangRef.html#t_derived).
 262 Prior to that, from
 263 [LLVM 1.0](http://releases.llvm.org/1.0/docs/LangRef.html#t_classifications) to
 264 [1.9](http://releases.llvm.org/1.9/docs/LangRef.html#t_classifications), LLVM
 265 uses signed types like "sbyte" and "ubyte". This shift was important and has
 266 served LLVM well over the years. The reason this is important is that it is a
 267 good thing for an intermediate representation to represent the same computation
 268 with the same instruction. Signed types got in the way, because (e.g.) an "add
 269 of an sbyte" does the same computation as an "add of a ubyte", but the type
 270 system made them look artificially different. This split also required casts
 271 like "cast from sbyte to ubyte" which do nothing at the machine level. Removing
 272 signs from the type system eliminated these problems, making the compiler
 273 simpler.
 274
 275 More information about this split is available in an old
 276 [talk on youtube](https://www.youtube.com/watch?v=VeRaLPupGks) talking about
 277 LLVM 2.0.
 278
 279 Note that this rationale only applies to the "standard ops" dialect in which we
 280 can express an opinion about its design. Other dialects generally try to model
 281 an external system, and should aim to reflect its design as closely as possible.
 282
 283 ### Splitting floating point vs integer operations
 284
 285 The MLIR "Arith" dialect splits many integer and floating point operations
 286 into different categories, for example `arith.addf` vs `arith.addi` and
 287 `arith.cmpf` vs `arith.cmpi`
 288 ([following the design of LLVM](http://llvm.org/docs/LangRef.html#binary-operations)).
 289 These instructions *are* polymorphic on the number of elements in the type
 290 though, for example `addf` is used with scalar floats, vectors of floats, and
 291 tensors of floats (LLVM does the same thing with its scalar/vector types).
 292
 293 This split is important because floating point and integer operations are quite
 294 different in practice: for example, floating point values include NaN's, so
 295 [integer comparisons](http://llvm.org/docs/LangRef.html#icmp-instruction) and
 296 [floating point comparisons](http://llvm.org/docs/LangRef.html#fcmp-instruction)
 297 should use different comparison opcodes. On the arithmetic side of things,
 298 floating point operations support rounding modes, floating point contractions,
 299 ["fast math"](http://llvm.org/docs/LangRef.html#fadd-instruction), and integers
 300 may want to have two's complement overflow behavior or be undefined on
 301 [various forms of wrapping](http://llvm.org/docs/LangRef.html#add-instruction)
 302 for performance.
 303
 304 We are a long way from this sort of thing being a priority to care about in
 305 MLIR, but since we have experience and know the right way to do this, we'd
 306 rather design it in from the beginning.
 307
 308 Note that this rationale only applies to the "standard ops" dialect in which we
 309 can express an opinion about its design. Other dialects generally try to model
 310 an external system, and should aim to reflect its design as closely as possible.
 311
 312 ### Specifying sign in integer comparison operations
 313
 314 Since integers are [signless](#integer-signedness-semantics), it is necessary to
 315 define the sign for integer comparison operations. This sign indicates how to
 316 treat the foremost bit of the integer: as sign bit or as most significant bit.
 317 For example, comparing two `i4` values `0b1000` and `0b0010` yields different
 318 results for unsigned (`8 > 3`) and signed (`-8 < 3`) interpretations. This
 319 difference is only significant for *order* comparisons, but not for *equality*
 320 comparisons. Indeed, for the latter all bits must have the same value
 321 independently of the sign. Since both arguments have exactly the same bit width
 322 and cannot be padded by this operation, it is impossible to compare two values
 323 whose bit representations would differ while the values are interpreted as
 324 equal.
 325
 326 ### Specifying comparison kind as attribute
 327
 328 Unlike arithmetic, comparison operators share several common properties, e.g.
 329 they cannot be considered associative. In practice, comparisons are sometimes
 330 implemented by the same instruction or its variants so it makes sense to group
 331 them together at the IR level.
 332
 333 An alternative would be introducing ten distinct operators for all currently
 334 supported kinds of integer comparisons. These operators would have increased the
 335 number of "reserved" names used by standard operations as well as the size of
 336 the C++ API while their implementations would have been mostly identical.
 337
 338 The comparison kind is internally an integer attribute. However, for the sake of
 339 readability by humans, custom assembly form accepts string literals that are
 340 mapped to the underlying integer values: `cmpi "eq", %lhs, %rhs` better implies
 341 integer equality comparison than `cmpi 0, %lhs, %rhs` where it is unclear what
 342 gets compared to what else. This syntactic sugar is possible thanks to parser
 343 logic redefinitions for custom assembly form of non-builtin operations.
 344 Supporting it in the full notation would have required changing how the main
 345 parsing algorithm works and may have unexpected repercussions. While it had been
 346 possible to store the predicate as string attribute, it would have rendered
 347 impossible to implement switching logic based on the comparison kind and made
 348 attribute validity checks (one out of ten possible kinds) more complex.
 349
 350 ### Regions
 351
 352 #### Attributes of type 'Block'
 353
 354 We considered representing regions through `ArrayAttr`s containing a list of a
 355 special type `IRBlockAttr`, which in turn would contain a list of operations.
 356 All attributes in MLIR are unique’d within the context, which would make the IR
 357 inside the regions immortal for no good reason.
 358
 359 #### Use "inlined" functions as regions
 360
 361 We considered attaching a "force-inline" attribute on a function and/or a
 362 function `call` operation. Even the minimal region support (use cases in
 363 affine.for and affine.if existing before the regions) requires access to the
 364 values defined in the dominating block, which is not supported by functions.
 365 Conceptually, function bodies are instances of regions rather than the inverse;
 366 regions can also be device kernels, alternative sections, etc.
 367
 368 #### Dedicated `region` operation
 369
 370 This would mean we have a special kind of operation that is allowed to have
 371 regions while other operations are not. Such distinction is similar to the
 372 Stmt/Op difference we have had and chose to remove to make the IR simpler and
 373 more flexible. It would also require analyses and passes to consider the
 374 interplay between operations (e.g., an `affine.for` operation must be followed
 375 by a region operation). Finally, a region operation can be introduced using the
 376 current implementation, among other operations and without being special in any
 377 sense.
 378
 379 #### Explicit capture of the values used in a region
 380
 381 Being able to use values defined outside the region implies that use-def chains
 382 may contain uses from different nested regions. Consequently, IR transformations
 383 and analyses can pull the instruction defining the value across region
 384 boundaries, for example in case of TableGen-defined canonicalization patterns.
 385 This would not be the case if all used values had been passed as region
 386 arguments. One of the motivations for introducing regions in the IR is precisely
 387 to enable cross-region analyses and transformations that are simpler than
 388 inter-procedural transformations. Having uses from different regions appear in
 389 the same use-def chain, contrary to an additional data structure maintaining
 390 correspondence between function call arguments as uses of the original
 391 definitions and formal arguments as new definitions, enables such
 392 simplification. Since individual operations now belong to blocks, which belong
 393 to regions, it is always possible to check if the definition of the value
 394 belongs to the same region as its particular use. The risk is that any IR
 395 traversal will need to handle explicitly this situation and it is easy to forget
 396 a check (or conversely it isn’t easy to design the right check in a tablegen
 397 pattern for example): traversing use-def chains potentially crosses implicitly
 398 semantic barriers, making it possible to unknowingly break region semantics.
 399 This is expected to be caught in the verifier after the transformation.
 400
 401 At the same time, one may choose to pass certain or all values as region
 402 arguments to explicitly break the use-def chains in the current proposal. This
 403 can be combined with an attribute-imposed semantic requirement disallowing the
 404 body of the region to refer to any value from outside it.
 405
 406 ### Dialect type extensions
 407
 408 This section describes the design decisions that shaped the dialect extensible
 409 type system present in MLIR.
 410
 411 #### Interactions between dialects
 412
 413 There are two different interactions between dialects that are important to
 414 understand. When types of a dialect are:
 415
 416 *   In operations of other dialects
 417
 418     -   For standard/builtin operations, only builtin types are allowed. This
 419         restriction allows for operations to clearly understand the invariants
 420         that they are working under.
 421     -   Outside of standard/builtin operations, dialects are expected to verify
 422         the allowable operation types per operation.
 423
 424 *   In types of other dialects
 425
 426     -   For builtin types, these types are allowed to contain types from other
 427         dialects. This simplifies the type system and removes the need for
 428         dialects to redefine all of the builtin aggregate types, e.g. tensor, as
 429         well as the memref type. Dialects are expected to verify that a specific
 430         type is valid within a builtin type, e.g. if a type can be an element of
 431         a tensor.
 432     -   For dialect types, the dialect is expected to verify any type
 433         invariants, e.g. if the tensor type can contain a specific type of that
 434         dialect.
 435
 436 #### Separating builtin and standard types
 437
 438 Following the separation between the built-in and standard dialect, it makes
 439 sense to separate built-in types and standard dialect types. Built-in types are
 440 required for the validity of the IR itself, e.g. the function type (which
 441 appears in function signatures and generic assembly forms of operations).
 442 Integer, float, vector, memref and tensor types, while important, are not
 443 necessary for IR validity.
 444
 445 #### Unregistered types
 446
 447 MLIR supports unregistered operations in generic assembly form. MLIR also
 448 supports a similar concept for types. When parsing, if the dialect for dialect
 449 type has not been registered the type is modeled as an 'OpaqueType'. This allows
 450 for types to be round-tripped without needing to link in the dialect library
 451 that defined them. No additional information about opaque types, outside of
 452 parsing/printing, will be available.
 453
 454 #### Dialect type syntax
 455
 456 Dialect extended types are represented as string literals wrapped inside of the
 457 dialect namespace. This means that the parser delegates to the dialect for
 458 parsing specific type instances. This differs from the representation of dialect
 459 defined operations, of which have an identifier name that the parser uses to
 460 identify and parse them.
 461
 462 This representation was chosen for several reasons:
 463
 464 ##### Dialects must provide custom type parsers
 465
 466 Dialect type parsing cannot plug into the existing parser infrastructure as
 467 operations do with the OpAsmParser/Printer. Operations have a defined syntax
 468 structure that is the same across all dialects. Types, on the other hand, may
 469 have many different, and sometimes conflicting, parsing constraints that would
 470 be difficult/unmaintainable to provide within a single interface.
 471
 472 This also has the added benefit of encouraging dialects to reuse existing
 473 external type parsers. For example, an LLVM dialect may provide an MLIR LLVM
 474 type that is simply a wrapper around LLVM types. The LLVM dialect would then use
 475 the existing LLVM type parsing infrastructure.
 476
 477 Example:
 478
 479 ```mlir
 480 %s = "foo"() : () -> !llvm<"i32*">
 481 ```
 482
 483 ##### Types do not always have canonical names
 484
 485 Unlike operations, types generally do not have a formal canonical name. For
 486 example, function types have no defined keyword and integer types are defined by
 487 a regular expression to support arbitrary bitwidth. Dialects with existing type
 488 systems, e.g. LLVM, are likely to provide wrappers around their existing type
 489 systems. For these wrapper types there is no simple canonical name, it's logical
 490 to think of these types as existing within the namespace of the dialect. If a
 491 dialect wishes to assign a canonical name to a type, it can be done via
 492 [type aliases](../LangRef.md/#type-aliases).
 493
 494 ### Tuple types
 495
 496 The MLIR type system provides first class support for defining
 497 [tuple types](../Dialects/Builtin/#tupletype). This is due to the fact that
 498 `Tuple` represents a universal concept that is likely to, and has already begun
 499 to, present itself in many different dialects. Though this type is first class
 500 in the type system, it merely serves to provide a common mechanism in which to
 501 represent this concept in MLIR. As such, MLIR provides no standard operations
 502 for interfacing with `tuple` types. It is up to dialect authors to provide
 503 operations, e.g. extract_tuple_element, to interpret and manipulate them. When
 504 possible, operations should prefer to use multiple results instead. These
 505 provide a myriad of benefits, such as alleviating any need for tuple-extract
 506 operations that merely get in the way of analysis and transformation.
 507
 508 ### Assembly forms
 509
 510 MLIR decides to support both generic and custom assembly forms under the
 511 following considerations:
 512
 513 MLIR is an open system; it is designed to support modular and pluggable
 514 dialects. Depending on whether there exists a corresponding dialect and whether
 515 the dialect is plugged in, operations may or may not be registered into MLIR
 516 system. Yet we still need a way to investigate these operations. So the generic
 517 assembly form is mandated by this aspect of MLIR system. It provides a default
 518 textual form for operations.
 519
 520 On the other hand, an assembly form is for assisting developers to investigate
 521 the IR. The generic form serves as a safe fallback but it can be too verbose for
 522 certain ops. Therefore, MLIR gives each dialect the choice to define a custom
 523 assembly form for each operation according to the operation's semantics and
 524 specific needs. The custom assembly form can de-duplicate information from the
 525 operation to derive a more concise form, thus better facilitating the
 526 comprehension of the IR.
 527
 528 ## Examples
 529
 530 This section describes a few very simple examples that help understand how MLIR
 531 represents computation.
 532
 533 ### Non-affine control flow
 534
 535 ```mlir
 536 // A simple linear search in every row of a matrix
 537 for (i = 0; i < N; i++) {
 538   for (j = 0; j < N; j++) {
 539     // dynamic control flow
 540     if (a[i][j] == key) {
 541       s[i] = j;
 542       break;
 543     }
 544   }
 545 }
 546 ```
 547
 548 The presence of dynamic control flow leads to an inner non-affine function
 549 nested in an outer function that uses affine loops.
 550
 551 ```mlir
 552 func.func @search(%A: memref<?x?xi32>, %S: <?xi32>, %key : i32) {
 553   %ni = memref.dim %A, 0 : memref<?x?xi32>
 554   // This loop can be parallelized
 555   affine.for %i = 0 to %ni {
 556     call @search_body (%A, %S, %key, %i) : (memref<?x?xi32>, memref<?xi32>, i32, i32)
 557   }
 558   return
 559 }
 560
 561 func.func @search_body(%A: memref<?x?xi32>, %S: memref<?xi32>, %key: i32, %i : i32) {
 562   %nj = memref.dim %A, 1 : memref<?x?xi32>
 563   cf.br ^bb1(0)
 564
 565 ^bb1(%j: i32)
 566   %p1 = arith.cmpi "lt", %j, %nj : i32
 567   cf.cond_br %p1, ^bb2, ^bb5
 568
 569 ^bb2:
 570   %v = affine.load %A[%i, %j] : memref<?x?xi32>
 571   %p2 = arith.cmpi "eq", %v, %key : i32
 572   cf.cond_br %p2, ^bb3(%j), ^bb4
 573
 574 ^bb3(%j: i32)
 575   affine.store %j, %S[%i] : memref<?xi32>
 576   cf.br ^bb5
 577
 578 ^bb4:
 579   %jinc = arith.addi %j, 1 : i32
 580   cf.br ^bb1(%jinc)
 581
 582 ^bb5:
 583   return
 584 }
 585 ```
 586
 587 As per the [MLIR spec](../LangRef.md), the restrictions on dimensions and symbol
 588 identifiers to be used with the affine.apply operation only apply to accesses
 589 inside `affine.for` and `affine.if` operations. However, an analysis of accesses
 590 inside the called function (`@search_body`) is necessary to determine if the
 591 `%i` loop could be parallelized: such function access analysis is calling
 592 context sensitive.
 593
 594 ### Non-affine loop bounds
 595
 596 Loop bounds that are not affine lead to a nesting of functions as shown below.
 597
 598 ```c
 599 for (i = 0; i < N; i++)
 600   for (j = 0; j < N; j++)
 601     // Non-affine loop bound for k loop.
 602     for (k = 0; k < pow(2, j); k++)
 603        for (l = 0; l < N; l++) {
 604         // block loop body
 605         ...
 606        }
 607 ```
 608
 609 ```mlir
 610 func.func @outer_nest(%n : index) {
 611   affine.for %i = 0 to %n {
 612     affine.for %j = 0 to %n {
 613       %pow = call @pow(2, %j) : (index, index) ->  index
 614       call @inner_nest(%pow, %n) : ...
 615     }
 616   }
 617   return
 618 }
 619
 620 func.func @inner_nest(%m : index, %n : index) {
 621   affine.for %k = 0 to %m {
 622     affine.for %l = 0 to %n {
 623       ...
 624     }
 625   }
 626   return
 627 }
 628 ```
 629
 630 ### Reference 2D Convolution
 631
 632 The following example illustrates a reference implementation of a 2D
 633 convolution, which uses an integer set `#domain` to represent valid input data
 634 in a dilated convolution.
 635
 636 ```mlir
 637 // Dilation factors S0 and S1 can be constant folded if constant at compile time.
 638 #domain = (d0, d1)[S0,S1,S2,S3]: (d0 % S0 == 0, d1 % S1 == 0, d0 >= 0, d1 >= 0,
 639                                    S3 - d0 - 1 >= 0, S4 - d1 - 1 >= 0)
 640 // Identity map (shown here for illustration).
 641 #map0 = (d0, d1, d2, d3, d4, d5, d6) -> (d0, d1, d2, d3, d4, d5, d6)
 642
 643 // Affine map from output to input coordinate space.
 644 // d0 = output_h, d1 = output_w, d2 = kernel_h, d3 = kernel_w
 645 // S0 = h_stride, S1 = w_stride, S2 = h_kernel_dilation, S3 = w_kernel_dilation
 646 // S4 = h_pad_low, S5 = w_pad_low
 647 //     %out0 =  %0#1 * %h_stride + %0#4 * %h_kernel_dilation - %h_pad_low
 648 //     %out1=  %0#2 * %w_stride + %0#5 * %w_kernel_dilation - %w_pad_low
 649 #map1_0 = (d0, d1, d2, d3) [S0, S1, S2, S3, S4, S5] -> (d0 * S0 + d2 * S2 - %S4)
 650 #map1_1 = (d0, d1, d2, d3) [S0, S1, S2, S3, S4, S5] -> (d1 * S1 + d3 * S3 - %S5)
 651
 652 // Semi-affine map to undilated input coordinate space.
 653 // d0 = input_h, d1 = input_w, S0 = h_base_dilation, S1 = w_base_dilation.
 654 #map2_0 = (d0, d1) [S0, S1] -> (d0 / S0)
 655 #map2_1 = (d0, d1) [S0, S1] -> (d1 / S1)
 656
 657 // Conv2D shapes:
 658 // input:   [batch, input_height, input_width, input_feature]
 659 // kernel: [kernel_height, kernel_width, input_feature, output_feature]
 660 // output: [batch, output_height, output_width, output_feature]
 661 func.func @conv2d(%input: memref<16x1024x1024x3xf32, #lm0, /*scratchpad=*/1>,
 662              %kernel: memref<5x5x3x32xf32, #lm0, /*scratchpad=*/1>,
 663              %output: memref<16x512x512x32xf32, #lm0, /*scratchpad=*/1>) {
 664   affine.for %b = 0 to %batch {
 665     affine.for %oh = 0 to %output_height {
 666       affine.for %ow = 0 to %output_width {
 667         affine.for %of = 0 to %output_feature {
 668           affine.for %kh = 0 to %kernel_height {
 669             affine.for %kw = 0 to %kernel_width {
 670               affine.for %if = 0 to %input_feature {
 671                 // Calculate input indices.
 672                 %1_0 = affine.apply #map1_0 (%0#1, %0#2, %0#4, %0#5)
 673                   [%h_stride, %w_stride, %h_kernel_dilation, %w_kernel_dilation,
 674                    %h_pad_low, %w_pad_low]
 675                 %1_1 = affine.apply #map1_1 (%0#1, %0#2, %0#4, %0#5)
 676                   [%h_stride, %w_stride, %h_kernel_dilation, %w_kernel_dilation,
 677                    %h_pad_low, %w_pad_low]
 678
 679                 // Check if access is not in padding.
 680                 affine.if #domain(%1_0, %1_1)
 681                                        [%h_base_dilation, %w_kernel_dilation, %h_bound, %w_bound] {
 682                   %2_0 = affine.apply #map2 (%1_0, %1_1)
 683                   %2_1 = affine.apply #map2 (%1_0, %1_1)
 684                   // Compute: output[output_indices] += input[input_indices] * kernel[kernel_indices]
 685                   call @multiply_accumulate(%input, %kernel, %output, %b, %oh, %ow, %of, %kh, %kw, %if, %2_0, %2_1)
 686                 }
 687               }
 688             }
 689           }
 690         }
 691       }
 692     }
 693   }
 694   return
 695 }
 696 ```
 697
 698 TODO: (Add more examples showing the IR for a variety of interesting cases)
 699
 700 ## Design alternatives and extensions
 701
 702 This is a list of some design alternatives and extensions that we discussed in
 703 detail but did not include in the spec or postponed them for future
 704 consideration on demand. We will revisit these discussions when we have more
 705 implementation experience and learn more about the challenges and limitations of
 706 our current design in practice.
 707
 708 ### Polyhedral code representation alternatives: schedule lists vs schedules trees vs affine loop/if forms
 709
 710 The current MLIR uses a representation of polyhedral schedules using a tree of
 711 if/for loops. We extensively debated the tradeoffs involved in the typical
 712 unordered polyhedral instruction representation (where each instruction has
 713 multidimensional schedule information), discussed the benefits of schedule tree
 714 forms, and eventually decided to go with a syntactic tree of affine if/else
 715 conditionals and affine for loops. Discussion of the tradeoff was captured in
 716 this document:
 717 [ MLIR: The case for a simplified polyhedral form](RationaleSimplifiedPolyhedralForm.md).
 718
 719 At a high level, we have two alternatives here:
 720
 721 1.  Schedule tree representation instead of an affine loop AST form: The current
 722     proposal uses an affine loop and conditional tree form, which is syntactic
 723     and with no separation of domains as sets and schedules as multidimensional
 724     affine functions. A schedule tree form however makes polyhedral domains and
 725     schedules a first class concept in the IR allowing compact expression of
 726     transformations through the schedule tree without changing the domains of
 727     instructions. Such a representation also hides prologues, epilogues, partial
 728     tiles, complex loop bounds and conditionals making loop nests free of
 729     "syntax". Cost models instead look at domains and schedules. In addition, if
 730     necessary such a domain schedule representation can be normalized to
 731     explicitly propagate the schedule into domains and model all the cleanup
 732     code. An example and more detail on the schedule tree form is in the next
 733     section.
 734 1.  Having two different forms of "affine regions": an affine loop tree form and
 735     a polyhedral schedule tree form. In the latter, ops could carry attributes
 736     capturing domain, scheduling, and other polyhedral code generation options
 737     with IntegerSet, AffineMap, and other attributes.
 738
 739 #### Schedule Tree Representation for Affine Regions
 740
 741 This representation is based on a simplified form of the domain/schedule
 742 representation used by the polyhedral compiler community. Domains represent what
 743 has to be executed while schedules represent the order in which domain elements
 744 are interleaved. We model domains as non-piece-wise convex integer sets, and
 745 schedules as affine functions; however, the former can be disjunctive, and the
 746 latter can be piece-wise affine relations. In the schedule tree representation,
 747 domain and schedules for instructions are represented in a tree-like structure
 748 which is called a schedule tree. Each non-leaf node of the tree is an abstract
 749 polyhedral dimension corresponding to an abstract fused loop for each ML
 750 instruction that appears in that branch. Each leaf node is an ML Instruction.
 751
 752 ```mlir
 753 // A tiled matmul code (128x128x128) represented in schedule tree form
 754
 755 // #map0 = (d0, d1, d2, d3, d4, d5) -> (128*d0 + d3, 128*d1 + d4, 128*d2 + d5)
 756 #intset_ij = (i, j) [M, N, K]  : i >= 0, -i + N - 1 >= 0, j >= 0, -j + N-1 >= 0
 757 #intset_ijk = (i, j, k) [M, N, K] : i >= 0, -i + N - 1 >= 0, j >= 0,
 758                                      -j +  M-1 >= 0, k >= 0, -k + N - 1 >= 0)
 759 func.func @matmul(%A, %B, %C, %M, %N, %K) : (...)  { // %M, N, K are symbols
 760   // t1, t2, t3, t4, t5, t6  are abstract polyhedral loops
 761   mldim %t1 : {S1,S2,S3,S4,S5}  floordiv (i, 128) {
 762     mldim %t2 : {S1,S2,S3,S4,S5}  floordiv (j, 128) {
 763       // (%i, %j) = affine.apply (d0, d1) -> (128*d0, 128*d1) (%t1, %t2)
 764       call dma_mem_to_scratchpad(%C, %i, %j, %M, %N, %K)
 765           with @intset_ij(%i, %j) [%M, %N, %K]
 766       mldim %t3 :   {S2,S3,S4,S5} floordiv (k, 128) {
 767         // (%i, %j, %k) = affine.apply (d0, d1, d2)
 768         //                          -> (128*d0, 128*d1, 128*d2)  (%t1, %t2, %t3)
 769         call dma_mem_to_scratchpad(%A, ...) with #inset_ijk (%i, %j, %k) [%M, %N, %K]
 770         // (%i, %j, %k) = affine.apply (d0, d1, d2)
 771         //                          -> (128*d0, 128*d1, 128*d2)  (%t1, %t2, %t3)
 772         call dma_mem_to_scratchpad(%B, ...) with #inset_ijk (%i, %j, %k) [%M, %N, %K]
 773         mldim %t4 : {S4} i mod 128 {
 774           mldim %t5 : {S4} j mod 128 {
 775             mldim %t6 : {S4} k mod 128 {
 776               // (%i, %j, %k) = affine.apply #map0 (%t1, %t2, %t3, %t4, %t5, %t6)
 777               call matmul_body(A, B, C, %i, %j, %k, %M, %N, %K)
 778                   with #inset_ijk(%i, %j, %k) [%M, %N, %K]
 779             } // end mld4im t6
 780           } // end mldim t5
 781         } // end mldim t4
 782       } // end mldim t3
 783       // (%i, %j) = affine.apply (d0, d1) -> (128*d0, 128*d1) (%t1, %t2)
 784       call $dma_scratchpad_to_mem_C ... with #intset(%i, %j) [%M, %N, %K]
 785     }  // end mldim t2
 786   } // end mldim t1
 787   return
 788 }
 789
 790 ```
 791
 792 ### Affine Relations
 793
 794 The current MLIR spec includes affine maps and integer sets, but not affine
 795 relations. Affine relations are a natural way to model read and write access
 796 information, which can be very useful to capture the behavior of external
 797 library calls where no implementation is available, high-performance vendor
 798 libraries, or user-provided / user-tuned routines.
 799
 800 An affine relation is a relation between input and output dimension identifiers
 801 while being symbolic on a list of symbolic identifiers and with affine
 802 constraints on the identifiers.
 803
 804 Syntax:
 805
 806 ```
 807 // Affine relation definition at the top of file
 808 affine-rel-def ::= affine-rel-id `=` affine-relation-inline
 809
 810 affine-rel-id ::= `##` prefixed-id
 811
 812 affine-relation-inline ::=
 813        `(` input-dims `)` (`[` symbols `]`)? `->`
 814        `(` output-dims `)` :  affine-constraint-conjunction
 815
 816 input-dims ::= bare-id-list
 817 output-dims ::= bare-id-list
 818 symbols ::= bare-id-list
 819
 820 affine-rel ::= affine-rel-id | affine-relation-inline
 821
 822 // Usage
 823 affine-rel-spec ::= affine-rel dim-and-symbol-use-list
 824 ```
 825
 826 All identifiers appearing in input-dims, output-dims, and symbol-dims are
 827 pairwise distinct. All affine-constraint non-terminals in the above syntax are
 828 allowed to contain identifiers only from input-dims, output-dims, and
 829 symbol-dims.
 830
 831 Affine relations are used to model read, write, may_read, and may_write sets of
 832 functions in the IR. The output dimension identifiers correspond to the data
 833 dimensions.
 834
 835 Example:
 836
 837 ```mlir
 838 // read relation: two elements ( d0 <= r0 <= d0+1 )
 839 ##aff_rel9 = (d0) -> (r0) : r0 - d0 >= 0, d0 - r0 + 1 >= 0
 840
 841 func.func @count (%A : memref<128xf32>, %pos : i32) -> f32
 842   reads: {%A ##aff_rel9 (%pos)}
 843   writes: /* empty */
 844   may_reads: /* empty */
 845   may_writes: /* empty */ {
 846 bb0 (%0, %1: memref<128xf32>, i64):
 847   %val = affine.load %A [%pos]
 848   %val = affine.load %A [%pos + 1]
 849   %p = arith.mulf %val, %val : f32
 850   return %p : f32
 851 }
 852 ```
 853
 854 ### Regions
 855
 856 #### Making function definition an operation
 857
 858 MLIR supports values of a Function type. Instead of having first-class IR
 859 concept for functions, one could define an operation with a body region that
 860 defines a function value. The particularity of functions is that their names are
 861 globally visible and can be referred to before being defined, unlike SSA values
 862 that must be defined first. Implementing a "function definition" operation would
 863 require to relax some of the SSA constraints in a region, and also make the IR
 864 Module a region as well. It would also affect the core infrastructure (e.g.,
 865 function passes) only for the sake of concept unification.
 866
 867 #### Having types on a region
 868
 869 Instead of inspecting the types of arguments of the first block, one could give
 870 the region itself a type. This type would be redundant with block argument
 871 types, which must have values and create room for type mismatches. While
 872 functions do have types that are partly redundant with the arguments of the
 873 first block in the function, this is necessary to support function declarations
 874 that do not have a body which we can refer to in order to obtain the argument
 875 types. A region is always contained in an operation or a function that can be
 876 queried to obtain the “type” of the region if necessary.
 877
 878 A type on a region can be justified if Regions were to be considered separately
 879 from the enclosing entity (operation or function) and had their own semantics
 880 that should be checked.
 881
 882 #### Attaching attributes to regions
 883
 884 Regions could be annotated with dialect attributes to use attribute verification
 885 hooks. An operation could take multiple regions as arguments, and each of them
 886 may require different attributes. However, there are currently very few
 887 practical cases where this would be necessary. Instead, one could simulate
 888 per-region attributes with array attributes attached to the entity containing
 889 the region (operation or function). This decreases the overall complexity of the
 890 IR and enables more concise and op-specific forms, e.g., when all regions of an
 891 op have the same attribute that can be only mentioned once. Since the semantics
 892 of the region is entirely defined by the enclosing entity, it also makes sense
 893 to have attributes attached to that entity rather than to the region itself.
 894
 895 This can be reconsidered in the future if we see a non-neglectable amount of use
 896 cases.
 897
 898 ### Read/Write/May_Read/May_Write sets for External Functions
 899
 900 Having read, write, may_read, and may_write sets for external functions which
 901 include opaque ones, high-performance vendor libraries such as CuDNN, CuB, MKL,
 902 FFT libraries, user-provided/optimized functions, or data movement runtimes such
 903 as DMA ones is a powerful feature. It allows the compiler to perform analysis,
 904 composition/transformation in the presence of such calls and with loops around
 905 such calls on sub-tensors. For user-provided or custom hand-tuned functions, the
 906 read/write/may_read/may_write sets could be provided a-priori by a user as part
 907 of the external function signature or they could be part of a database.
 908
 909 TODO: Design this, and update to use function attribute syntax.
 910
 911 Example:
 912
 913 ```mlir
 914 ##rel9 ( ) [s0] -> (r0, r1) : 0 <= r0 <= 1023, 0 <= r1 <= s0 - 1
 915
 916 func.func @cblas_reduce_ffi(%M: memref<1024 x ? x f32, #layout_map0, /*mem=*/0>)
 917   -> f32 [
 918   reads: {%M, ##rel9() }
 919   writes: /* empty */
 920   may_reads: /* empty */
 921   may_writes: /* empty */
 922 ]
 923
 924 func.func @dma_mem_to_scratchpad(%a : memref<1024 x f32, #layout_map0, /*mem=*/0>,
 925     %b : memref<1024 x f32, #layout_map0, 1>, %c : memref<1024 x f32,
 926     #layout_map0>) [
 927   reads: {%M, ##rel9() }
 928   writes: /* empty */
 929   may_reads: /* empty */
 930   may_writes: /* empty */
 931  ]
 932
 933 ```
 934
 935 ### Memref Extensions
 936
 937 1.  Arbitrary polyhedral shapes for tensors: e.g., triangular shapes in tensor
 938     dimensions where there is symmetry: use integer set (affine constraints) to
 939     model tensor data space (instead of just extents). Requires some changes to
 940     the IR and the in-memory form.
 941 1.  Layout maps
 942
 943     1.  Allow piece-wise affine maps for layouts: allows clean modeling of
 944         boundary cases for images/tensors through padding, wrapping, mirroring,
 945         padding where padded values are the results of computation as opposed to
 946         data, padding in the interior as opposed to just boundaries.
 947     1.  Allow many-to-one layout maps: Index and layout maps in the current
 948         proposal are bijective. Extending them to many-to-one layout maps allows
 949         cleaner(?) modeling of broadcast/reduce style computations while reusing
 950         memory.
 951
 952     Proposal 2(a) requires non-trivial changes to the IR and the in-memory
 953     representation. 2(b) requires no change, but impacts how cost models look at
 954     index and layout maps.
 955
 956 ### `affine.if` and `affine.for` Extensions for "Escaping Scalars"
 957
 958 We considered providing a representation for SSA values that are live out of
 959 `if/else` conditional bodies and loop carried in `affine.for` loops. We
 960 ultimately abandoned this approach due to its complexity. In the current design
 961 of MLIR, scalar variables cannot escape for loops or if instructions. In
 962 situations, where escaping is necessary, we use zero-dimensional tensors and
 963 memrefs instead of scalars.
 964
 965 **TODO**: This whole section is obsolete and should be updated to use block
 966 arguments and a yield like terminator in for/if instructions.
 967
 968 The abandoned design of supporting escaping scalars is as follows:
 969
 970 #### affine.for Instruction
 971
 972 Syntax:
 973
 974 ```
 975 [<out-var-list> =]
 976 for %<index-variable-name> = <lower-bound> ... <upper-bound> step <step>
 977    [with <in-var-list>] { <loop-instruction-list> }
 978 ```
 979
 980 out-var-list is a comma separated list of SSA values defined in the loop body
 981 and used outside the loop body. in-var-list is a comma separated list of SSA
 982 values used inside the loop body and their initializers. loop-instruction-list
 983 is a list of instructions that may also include a yield instruction.
 984
 985 Example:
 986
 987 ```mlir
 988 // Return sum of elements in 1-dimensional mref A
 989 func.func i32 @sum(%A : memref<?xi32>, %N : i32) -> (i32) {
 990    %init = 0
 991    %result = affine.for %i = 0 to N with %tmp(%init) {
 992       %value = affine.load %A[%i]
 993       %sum = %value + %tmp
 994       yield %sum
 995    }
 996    return %result : i32
 997 }
 998 ```
 999
1000 #### affine.if/else Instruction
1001
1002 Syntax:
1003
1004 ```
1005 <out-var-list> = affine.if (<cond-list>) {...} [else {...}]
1006 ```
1007
1008 Out-var-list is a list of SSA values defined by the if-instruction. The values
1009 are arguments to the yield-instruction that occurs in both then and else clauses
1010 when else clause is present. When if instruction contains only if clause, the
1011 escaping value defined in the then clause should be merged with the value the
1012 variable had before the if instruction. The design captured here does not handle
1013 this situation.
1014
1015 Example:
1016
1017 ```mlir
1018 // Compute sum of half of the array
1019 func.func i32 @sum_half(%A : memref<?xi32>, %N : i32) -> (i32) {
1020    %s0 = 0
1021    %s1 = affine.for %i = 1 ... N step 1 with %s2 (%s0) {
1022        %s3 = if (%i >= %N / 2) {
1023           %v0 = affine.load %A[%i]
1024           %s4 = %s2 + %v0
1025           yield %s4
1026        }
1027        yield %s3
1028    }
1029    return %s1 : i32
1030 }
1031 ```
1032
1033 ### Multithreading the compiler
1034
1035 People want compilers to go fast, and one simple way to do that is to
1036 multi-thread them. There are multiple strategies for this, but a simple one is
1037 to optimize and compile separate functions in parallel. LLVM's original pass
1038 manager anticipated this demand, and the CallGraphSCCPass manager is even
1039 designed to support this as well, but unfortunately, a few early design
1040 decisions in LLVM prevent this from ever happening. Instead, things like ThinLTO
1041 are forced to split programs into separate LLVM modules/context and optimize
1042 those chunks independently.
1043
1044 The problem is that LLVM has several objects in its IR that are globally uniqued
1045 and also mutable: notably constants like `i32 0`. In LLVM, these constants are
1046 `Value`'s, which allow them to be used as operands to instructions, and that
1047 they also have SSA use lists. Because these things are uniqued, every `i32 0` in
1048 any function shares a use list. This means that optimizing multiple functions in
1049 parallel won't work (at least without some sort of synchronization on the use
1050 lists, which would be unbearably inefficient).
1051
1052 MLIR now supports a multithreaded pass manager. We do this through several
1053 design choices:
1054
1055 1.  MLIR makes use of extensive uniqued immutable data structures (affine
1056     expressions, types, etc are all immutable, uniqued, and immortal).
1057 2.  Constants are defined in per-operation pools, instead of being globally
1058     uniqued.
1059 3.  Functions, and other global-like operations, themselves are not SSA values
1060     either, so they don't have the same problem as constants.
1061 4.  Passes are copied (through their copy ctor) into one instance per
1062     thread, avoiding sharing of local state across threads.
1063
1064 This allows MLIR passes to support efficient multithreaded compilation
1065 and code generation.