mlir/docs/Rationale/RationaleSimplifiedPolyhedralForm.md

   1 # MLIR: The case for a simplified polyhedral form
   2
   3 MLIR embraces polyhedral compiler techniques for their many advantages
   4 representing and transforming dense numerical kernels, but it uses a form that
   5 differs significantly from other polyhedral frameworks.
   6
   7 **Disclaimer / Warning**
   8
   9 This document is a very early design proposal (which has since been accepted)
  10 that explored the tradeoffs of using this simplified form vs the traditional
  11 polyhedral schedule list form. At some point, this document could be dusted off
  12 and written as a proper academic paper, but until now, it is better to included
  13 it in this crafty form than not to. Beware that this document uses archaic
  14 syntax and should not be considered a canonical reference to modern MLIR.
  15
  16 ## Introduction
  17
  18 This document discusses general goals of the project, introduces context and the
  19 two alternatives, then talks about the tradeoffs of these designs. Written by
  20 Chris Lattner.
  21
  22 ## General goals of an IR, and goals of mlfunc's specifically
  23
  24 Our currently planned representation for MLIR consists of two kinds of
  25 functions: an LLVM-like "CFG Function" and an "ML Function": a function
  26 represented in multidimensional loop form. The idea is that a CFG function is
  27 capable of full generality for expressing arbitrary computation, but is awkward
  28 for loop transformations. In contrast, mlfunc's are limited (e.g. to control
  29 flow involving loop nests over affine spaces) but these limitations make it much
  30 easier to transform and analyze, particularly for the set of computations in a
  31 machine learning kernel.
  32
  33 The design of an intermediate representations is an optimization problem, which
  34 makes intentional tradeoffs that aim to make certain kinds of compiler
  35 transformations simple. After all, it is "possible" to do almost any
  36 transformation on any IR: we could theoretically do loop transformations on
  37 assembly language. OTOH, such transformations would take too long to write,
  38 would be fragile due to irrelevant changes, would be difficult to maintain, and
  39 difficult to make target independent. Performing transformations on the "right
  40 level" of IR makes it much easier to do analysis and transformation of code, and
  41 can make them faster by reducing the size of the IR, and eliminating
  42 possibilities that would have otherwise have to be considered.
  43
  44 This is the reason we're interested in adding polyhedral techniques to an IR in
  45 the first place: though our base "CFG function" representation is fully capable
  46 of expressing any computation, it is "too" expressive. The limitations imposed
  47 by polyhedral techniques (e.g. on affine loop bounds and array subscripts)
  48 define a closed algebra that can represent an interesting range of
  49 transformations and their compositions, and because of their simplicity, we can
  50 perform (e.g.) dependence analysis more efficiently and more reliably.
  51
  52 This raises an important question that this document examines: given we are
  53 introducing a redundant and limited way to express code and transformations,
  54 exactly what form is best to perform the analyses and transformations we want?
  55
  56 We explore two different design points that are capable of expressing the same
  57 class of affine loop computations, but which use different representational
  58 forms. These forms trade off verbosity, ease of transformation, and ease of
  59 analysis in interesting ways.
  60
  61 ## Context: Traditional Polyhedral Form
  62
  63 We started by discussing a representation that uses the traditional polyhedral
  64 schedule set + domain representation, e.g. consider C-like code like:
  65
  66 ```c
  67   void simple_example(...) {
  68     for (int i = 0; i < N; ++i) {
  69       for (int j = 0; j < N; ++j) {
  70          float tmp = X[i,j]    // S1
  71          A[i,j] = tmp + 1      // S2
  72          B[i,j] = tmp * 42     // S3
  73        }
  74     }
  75   }
  76 ```
  77
  78 The polyhedral representation doesn't care about the actual computation, so we
  79 will abstract them into S1/S2/S3 in the discussion below. Originally, we planned
  80 to represent this with a classical form like (syntax details are not important
  81 and probably slightly incorrect below):
  82
  83 ```mlir
  84   mlfunc @simple_example(... %N) {
  85     %tmp = call @S1(%X, %i, %j)
  86       domain: (0 <= %i < %N), (0 <= %j < %N)
  87       schedule: (i, j, 0)
  88
  89     call @S2(%tmp, %A, %i, %j)
  90       domain: (0 <= %i < %N), (0 <= %j < %N)
  91       schedule: (i, j, 1)
  92
  93     call @S3(%tmp, %B, %i, %j)
  94       domain: (0 <= %i < %N), (0 <= %j < %N)
  95       schedule: (i, j, 2)
  96   }
  97 ```
  98
  99 In this design, an mlfunc is an unordered bag of instructions whose execution
 100 order is fully controlled by their schedule.
 101
 102 However, we recently agreed that a more explicit schedule tree representation is
 103 a better fit for our needs, because it exposes important structure that will
 104 make analyses and optimizations more efficient, and also makes the scoping of
 105 SSA values more explicit. This leads us to a representation along the lines of:
 106
 107 ```mlir
 108   mlfunc @simple_example(... %N) {
 109     d0/d1 = mlspace
 110     for S1(d0), S2(d0), S3(d0) {
 111       for S1(d1), S2(d1), S3(d1) {
 112
 113         %tmp = call @S1(%X, d0, d1)      ;; S1
 114           domain: (0 <= d0 < %N), (0 <= d1 < %N)
 115
 116         call @S2(%tmp, %A, d0, d1)      ;; S2
 117           domain: (0 <= d0 < %N), (0 <= d1 < %N)
 118
 119         call @S3(%tmp, %B, d0, d1)      ;; S3
 120           domain: (0 <= d0 < %N), (0 <= d1 < %N)
 121       }
 122     }
 123   }
 124 ```
 125
 126 This change makes the nesting structure of the loops an explicit part of the
 127 representation, and makes lexical ordering within a loop significant
 128 (eliminating the constant 0/1/2 of schedules).
 129
 130 It isn't obvious in the example above, but the representation allows for some
 131 interesting features, including the ability for instructions within a loop nest
 132 to have non-equal domains, like this - the second instruction ignores the outer
 133 10 points inside the loop:
 134
 135 ```mlir
 136   mlfunc @reduced_domain_example(... %N) {
 137     d0/d1 = mlspace
 138     for S1(d0), S2(d0) {
 139       for S1(d1), S2(d1) {
 140         %tmp = call @S1(%X, d0, d1)    ;; S1
 141           domain: (0 <= d0 < %N), (0 <= d1 < %N)
 142
 143         call @S2(%tmp, %A, d0, d1)      ;; S2
 144           domain: (10 <= d0 < %N-10), (10 <= d1 < %N-10)
 145       }
 146     }
 147   }
 148 ```
 149
 150 It also allows schedule remapping within the instruction, like this example that
 151 introduces a diagonal skew through a simple change to the schedules of the two
 152 instructions:
 153
 154 ```mlir
 155   mlfunc @skewed_domain_example(... %N) {
 156     d0/d1 = mlspace
 157     for S1(d0), S2(d0+d1) {
 158       for S1(d0+d1), S2(d1) {
 159         %tmp = call @S1(%X, d0, d1)    ;; S1
 160           domain: (0 <= d0 < %N), (0 <= d1 < %N)
 161
 162         call @S2(%tmp, %A, d0, d1)      ;; S2
 163           domain: (0 <= d0 < %N), (0 <= d1 < %N)
 164       }
 165     }
 166   }
 167 ```
 168
 169 This form has great power, and the polyhedral code generator (which lowers from
 170 an mlfunc to a cfgfunc representation) handles this power so things that
 171 introduce loop transformations don't have to explicitly manipulate the looping
 172 structure.
 173
 174 ## Proposal: Simplified Polyhedral Form
 175
 176 This document proposes and explores the idea of going one step further, moving
 177 all of the domain and schedule information into the "schedule tree". In this
 178 form, we would have a representation where all instructions inside of a given
 179 for-loop are known to have the same domain, which is maintained by the loop. In
 180 the simplified form, we also have an "if" instruction that takes an affine
 181 condition.
 182
 183 Our simple example above would be represented as:
 184
 185 ```mlir
 186   mlfunc @simple_example(... %N) {
 187     affine.for %i = 0 ... %N step 1 {
 188       affine.for %j = 0 ... %N step 1 {
 189         // identity noop in this case, but can exist in general.
 190         %0,%1 = affine.apply #57(%i, %j)
 191
 192         %tmp = call @S1(%X, %0, %1)
 193
 194         call @S2(%tmp, %A, %0, %1)
 195
 196         call @S3(%tmp, %B, %0, %1)
 197       }
 198     }
 199   }
 200 ```
 201
 202 The example with the reduced domain would be represented with an if instruction:
 203
 204 ```mlir
 205   mlfunc @reduced_domain_example(... %N) {
 206     affine.for %i = 0 ... %N step 1 {
 207       affine.for %j = 0 ... %N step 1 {
 208         // identity noop in this case, but can exist in general.
 209         %0,%1 = affinecall #57(%i, %j)
 210
 211         %tmp = call @S1(%X, %0, %1)
 212
 213         if (10 <= %i < %N-10), (10 <= %j < %N-10) {
 214
 215           %2,%3 = affine.apply(%i, %j)    // identity noop in this case
 216
 217           call @S2(%tmp, %A, %2, %3)
 218         }
 219       }
 220     }
 221   }
 222 ```
 223
 224 These IRs represent exactly the same information, and use a similar information
 225 density. The 'traditional' form introduces an extra level of abstraction
 226 (schedules and domains) that make it easy to transform instructions at the
 227 expense of making it difficult to reason about how those instructions will come
 228 out after code generation. With the simplified form, transformations have to do
 229 parts of code generation inline with their transformation: instead of simply
 230 changing a schedule to **(i+j, j)** to get skewing, you'd have to generate this
 231 code explicitly (potentially implemented by making polyhedral codegen a library
 232 that transformations call into):
 233
 234 ```mlir
 235 mlfunc @skewed_domain_example(... %N) {
 236   affine.for %t1 = 0 ... 2*N-2 step 1 {
 237     affine.for %t2 = max(0, t1-N+1) ... min(N, t1) step 1 {
 238       (%i, %j) = (%t1-%t2, %t2)
 239       ...
 240     }
 241   }
 242 }
 243 ```
 244
 245 ## Evaluation
 246
 247 Both of these forms are capable of expressing the same class of computation:
 248 multidimensional loop nests with affine loop bounds and affine memory
 249 references. That said, they pose very different tradeoffs in other ways.
 250
 251 ### Commonality: can express same computation
 252
 253 Both of these can express the same sorts of computation, e.g. kernels written in
 254 one form are representable in the other form in all cases.
 255
 256 ### Commonality: dependence analysis
 257
 258 These representations both use affine functions for data layout mapping and
 259 access subscripts, and dependence analysis works the same way.
 260
 261 ### Commonality: difficulty of determining optimal transformation series
 262
 263 One major challenge in performance of optimization of this sort of code is
 264 choosing the ordering and behavior of various loop transformations that get
 265 applied. There are non-local effects of every decision, and neither
 266 representation helps solve this inherently hard problem.
 267
 268 ### Commonality: compactness of IR
 269
 270 In the cases that are most relevant to us (hyper rectangular spaces) these forms
 271 are directly equivalent: a traditional instruction with a limited domain (e.g.
 272 the "reduced_domain_example" above) ends up having one level of ML 'if' inside
 273 its loops. The simplified form pays for this by eliminating schedules and
 274 domains from the IR. Both forms allow code duplication to reduce dynamic
 275 branches in the IR: the traditional approach allows instruction splitting, the
 276 simplified form supports instruction duplication.
 277
 278 It is important to point out that the traditional form wins on compactness in
 279 the extreme cases: e.g. the loop skewing case. These cases will be rare in
 280 practice for our workloads, and are exactly the cases that downstream
 281 transformations want to be explicit about what they are doing.
 282
 283 ### Simplicity of code generation
 284
 285 A key final stage of an mlfunc is its conversion to a CFG function, which is
 286 required as part of lowering to the target machine. The simplified form has a
 287 clear advantage here: the IR has a direct correspondence to the structure of the
 288 generated code.
 289
 290 In contrast, the traditional form has significant complexity in the lowering
 291 process to a CFG function, because the verbosity not imbued in the IR needs to
 292 come out during code generation. Code generation from ISL shows that it is
 293 possible to do this, but it is a non-trivial transformation.
 294
 295 ### Ease of transformation
 296
 297 An advantage for the traditional form is that it is easier to perform certain
 298 transformations on it: skewing and tiling are just transformations on the
 299 schedule of the instructions in question, it doesn't require changing the loop
 300 structure.
 301
 302 In practice, the simplified form requires moving the complexity of code
 303 generation into the transformations themselves - this is sometimes trivial,
 304 sometimes involved. The author believes that this should be possible by making
 305 the code generation algorithms themselves be library functions that
 306 transformations call into, instead of an opaque block that happens at the end of
 307 the mlfunc processing.
 308
 309 Also, the sorts of transformations performed today by XLA (including tiling,
 310 padding, unrolling, and other rectangular transformations) should be easy enough
 311 to implement on either representation. The only cases that are a challenge are
 312 more advanced cases like skewing, e.g. for DMA data movement generation.
 313
 314 ### Ease of analysis: Cost models
 315
 316 The simplified form is much easier for analyses and transformations to build
 317 cost models for (e.g. answering the question of "how much code bloat will be
 318 caused by unrolling a loop at this level?"), because it is easier to predict
 319 what target code will be generated. With the traditional form, these analyses
 320 will have to anticipate what polyhedral codegen will do to a set of instructions
 321 under consideration: something that is non-trivial in the interesting cases in
 322 question (see "Cost of code generation").
 323
 324 ### Cost of code generation
 325
 326 State of the art polyhedral code generation is
 327 [expensive and complicated](https://lirias.kuleuven.be/bitstream/123456789/497238/1/toplas-astgen.pdf),
 328 sometimes exponential time complexity. We expect that most machine learning
 329 workloads will be hyper-rectangular, and thus it should be easy to specialize in
 330 important cases. That said, the traditional polyhedral representation makes it
 331 very easy to introduce complicated and expensive schedules, and provides no way
 332 to understand and project a cost model for using them. All downstream clients of
 333 the IR need to be prepared to handle the full generality of IR that may come to
 334 them.
 335
 336 The simplified form defines this away: the concepts in the IR remain simple, and
 337 the code much more directly reflects the cost model for lowering to CFG
 338 functions and machine code. This is expected to be very important in the late
 339 stages of a code generator for an accelerator.
 340
 341 ### SSA in ML Functions
 342
 343 We agree already that values defined in an mlfunc can include scalar values and
 344 they are defined based on traditional dominance. In the simplified form, this is
 345 very simple: arguments and induction variables defined in for-loops are live
 346 inside their lexical body, and linear series of instructions have the same "top
 347 down" dominance relation that a basic block does.
 348
 349 In the traditional form though, this is not the case: it seems that a lot of
 350 knowledge about how codegen will emit the code is necessary to determine if SSA
 351 form is correct or not. For example, this is invalid code:
 352
 353 ```mlir
 354   %tmp = call @S1(%X, %0, %1)
 355     domain: (10 <= %i < %N), (0 <= %j < %N)
 356     schedule: (i, j)
 357
 358   call @S2(%tmp, %A, %0, %1)
 359     domain: (0 <= %i < %N), (0 <= %j < %N)
 360     schedule: (i, j)
 361 ```
 362
 363 Because `%tmp` isn't defined on some iterations of the %i loop.
 364
 365 This matters because it makes the verifier more complicated, but more
 366 significantly, it means that load promotion and other optimizations that will
 367 produce SSA form will need to be aware of this and be able to model what codegen
 368 does.
 369
 370 An emergent property of this that we discussed recently is that PHI nodes in
 371 mlfunc's (if we support them) will also have to have domains.
 372
 373 ### Lack of redundancy in IR
 374
 375 The traditional form has multiple encodings for the same sorts of behavior: you
 376 end up having bits on `affine.for` loops to specify whether codegen should use
 377 "atomic/separate" policies, unroll loops, etc. Instructions can be split or can
 378 generate multiple copies of their instruction because of overlapping domains,
 379 etc.
 380
 381 This is a problem for analyses and cost models, because they each have to reason
 382 about these additional forms in the IR.
 383
 384 ### Suitability to purpose: lowering to machine code
 385
 386 One of the main drivers for this work is lowering to low-level accelerator code,
 387 including two-dimensional vectorization, insertion of DMAs, and other
 388 utilization of the matrix accelerator units. In the author's opinion, the extra
 389 compactness of the traditional form is a negative for this purpose: reasoning
 390 about the generated machine code will require understanding the mapping from
 391 mlfunc to lowered code, which means that it must understand what code generation
 392 will do.
 393
 394 In the simplified form, the effect of "code generation" is always obvious from
 395 the IR itself, which should make it easier to perform vectorization to target
 396 instructions and other analyses we need to perform.
 397
 398 ## Third Alternative: two different levels of mlfunc
 399
 400 One hybrid alternative is to support both the traditional and simplified forms
 401 of mlfunc in our IR.
 402
 403 The stages could look like this, for example:
 404
 405 1.  Early performance transformations could be done on the traditional form.
 406 1.  Partial code generation lowers to the simplified form
 407 1.  Target specific lowering phases for tiling, and vectorization and other 2D
 408     transforms that don't benefit much from the traditional form could be run.
 409 1.  Final codegen to a cfg func can be done when all of the instructions are
 410     replaced with ones valid on the target.
 411
 412 While this is possible, it isn't clear what would justify the complexity of this
 413 approach. Unless there is a super compelling reason for this, it would be nice
 414 to not do this. **Update:** we discussed this as a design team and agreed that
 415 this wouldn't be a good way to go.