mlir/docs/Rationale/MLIRForGraphAlgorithms.md

   1 # MLIR: Incremental Application to Graph Algorithms in ML Frameworks
   2
   3 The existing documentation about MLIR focuses on long term vision, how its
   4 pieces fit together, and the benefits of modular and composable infrastructure
   5 in the vast and distant future. While this viewpoint appeals to some, it causes
   6 concern for others who are more concerned about the "here and now" - why does it
   7 make sense to make a "revolutionary" change when any individual problem can be
   8 fixed in place?
   9
  10 This document explains that adoption of MLIR to solve graph based problems
  11 *isn't* a revolutionary change: it is an incremental series of steps which build
  12 on each other, each of which delivers local value. This document also addresses
  13 some points of confusion that keep coming up.
  14
  15 One note: even though a major advantage of MLIR is that it can span the full
  16 spectrum from graph algorithms down to low-level code generation, this document
  17 focuses on the use of MLIR for **graph-level algorithms**. MLIR will also unlock
  18 exciting code generation opportunities (particularly given its novel approach to
  19 integrating state of the art polyhedral techniques), but issues that touch on
  20 MLIR's relationship to XLA, Eigen, etc, are out of scope for this particular
  21 doc.
  22
  23 This document uses TensorFlow as the example given that it is the focus of our
  24 immediate work, but we believe that the same viewpoint could be useful for
  25 people working in the context of other ML frameworks that may consider adopting
  26 MLIR in the future.
  27
  28 ### How is MLIR relevant?
  29
  30 MLIR is an overloaded acronym which unpacks as "Multi-Level Intermediate
  31 Representation". Its high-level purpose is to provide mechanics for describing
  32 and transforming programs and computations in a flexible way. It provides common
  33 compiler infrastructure for things like constant folding, dead code elimination,
  34 graph rewriting, and others - which are independent of the representational
  35 choices picked by a given dialect (e.g. its concurrency semantics). It was built
  36 with a specific focus on compile time and memory efficiency, accurate
  37 propagation of source location information (important for reporting high quality
  38 errors and warnings) and is designed for testability.
  39
  40 TensorFlow has numerous subsystems (some of which are proprietary, e.g.
  41 Tensor-RT, nGraph, CoreML, etc) as well as translation layers between these
  42 different subsystems, and these translation layers face similar challenges. ((As
  43 an aside, the internals of each of these subsystems could often benefit from
  44 MLIR infrastructure, but that isn't a focus of this doc.))
  45
  46 A key observation that MLIR makes is that these subsystems often have two things
  47 going on: they are both particular data structures and encodings (e.g. HLO
  48 graphs, TF-Lite's flat buffer format, TensorFlow's Graph format, the ONNX
  49 abstraction, etc) as well as an abstraction of computation (a specific way of
  50 modeling a convolution, a set of supported operations etc).
  51
  52 MLIR uses a standard IR (i.e., a set of data structures) for representing these
  53 computations - this allows a huge amount of shared infrastructure across these
  54 problem domains. MLIR then allows the definition of domain-specific "dialects"
  55 that describe the set of operations that are legal and supported for a given
  56 application. This means that the actual translations between data structures are
  57 kept as simple as possible - and are thus relatively easy to make "correct".
  58 This allows the common compiler infrastructure to handle the mapping problems
  59 and the other issues within the domain.
  60
  61 MLIR's design is directly informed by the experience of building (and then
  62 living with) intermediate representations like the LLVM IR, LLVM SelectionDAG,
  63 the LLVM machine instruction representation, Swift SIL IR, and learns new
  64 lessons from TensorFlow and XLA HLO, as well as learning from building countless
  65 research and production systems on top of them. Our goal is to drag the state of
  66 the art in compilers forward, not to merely apply a few well-known techniques to
  67 the machine learning domain.
  68
  69 ### What does adoption mean?
  70
  71 The point of this document is not to advocate for rewriting any particular
  72 subsystem in TensorFlow - indeed, the burden required to justify a rewrite is
  73 high, and often very specific to that subsystem. That said, there are several
  74 subsystems that are about to get rewritten or substantially revised anyway, so
  75 we use those as examples to concretely describe the benefits that MLIR provides
  76 in these cases and what it will take. The subsystems discussed are:
  77
  78 1.  the TF Lite TOCO translator, which we need to improve error
  79     reporting/reliability issues and generalize it to support more ops, and
  80 1.  the TF/XLA bridge which needs to improve usability by merging some of its
  81     usage models, support dynamic shapes and generalize guest subsystem support
  82     to Tensor-RT and nGraph.
  83 1.  Grappler is another subsystem that is likely to get substantial revisions in
  84     the future, and would definitely benefit from the MLIR framework, but there
  85     are no known plans to do that work at this point, so we don't discuss it
  86     further.
  87
  88 Adopting MLIR for these works the same way - and, in fact, the work to support
  89 TF Lite is mostly a subset of the larger work to support the functionality of
  90 the TF/XLA bridge. TF Lite and the TF/XLA bridge include several compiler passes
  91 (things like encapsulate, functionalize control flow, lowering of ops, fusion,
  92 constant folding, shape inference, etc).
  93
  94 MLIR supports converting from TensorFlow Graphs to MLIR and back, which means
  95 that we can start by putting in a no-op translation to MLIR and back into the
  96 pipeline, and verify that nothing breaks. Then we can work on replacing the
  97 compiler transformations one by one by reimplementing them (with the improved
  98 algorithms that we're planning).
  99
 100 This is a development plan, we wouldn't actually ship a TensorFlow that just
 101 uses MLIR for a single pass. In practice, we'll have the MLIR flag gated under
 102 an option, build out a replacement for an entire subsystem (e.g. the TOCO
 103 translator) and when the time is right, we'll do A/B comparisons and eventually
 104 make a switch and phase out the old code over time.
 105
 106 ## What benefit does MLIR provide?
 107
 108 The adoption plan above might sound like it only makes things worse in the
 109 immediate term - we have two implementations of the same functionality, we are
 110 dividing our efforts, etc. In order for this to be worth it, we should have a
 111 good sense that we are building towards an improved future that will make
 112 customers and TensorFlow engineers happier when it lands. Here we describe a few
 113 of the benefits that MLIR provides, in no particular order:
 114
 115 ### A Lossless Human Editable Textual Representation
 116
 117 The MLIR in-memory data structure has a human readable and writable format, as
 118 well as [a specification](../LangRef.md) for that format - built just like any
 119 other programming language. Important properties of this format are that it is
 120 compact, easy to read, and lossless. You can dump an MLIR program out to disk
 121 and munge around with it, then send it through a few more passes.
 122
 123 If you haven't worked with a system that works this way, it is hard to overstate
 124 how big of a deal this in practice: it means that you can call `foo->dump()` on
 125 an IR object to see its full contents, it means you can diff the IR before and
 126 after a change, delta reduce IR files, and many other things.
 127
 128 ### A Graph Verification Pass
 129
 130 Like many other popular compiler infrastructures, MLIR provides infrastructure
 131 and implementation for a "verifier" which checks that the IR is well formed. The
 132 MLIR verifier is a simple framework that makes it easy to provide a single
 133 source of truth for those correctness properties and is general across all
 134 Dialects (e.g. TF Graph, TF Lite flat buffer, XLA HLO, etc).
 135
 136 A verifier pass is sort of like a 'super assertion' that catches mistakes in
 137 program transformations early, making you as an engineer more productive, making
 138 the product more reliable, and making it easier to track down bugs when they
 139 appear - because the verifier can be run at any time, either as a compiler pass
 140 or with a single function call.
 141
 142 While MLIR provides a well-considered infrastructure for IR verification, and
 143 has simple checks for existing TensorFlow operations, there is a lot that should
 144 be added here and lots of opportunity to get involved!
 145
 146 ### Designed for Testability
 147
 148 There are many aspects of this in MLIR, but we'll focus on compiler
 149 transformations since they are the easiest to understand. Compiler
 150 transformations are modeled as subclasses of the `Pass` C++ class, which are
 151 driven by an `mlir-opt` tool. When combined with a lossless textual
 152 representation, it becomes really easy to write unit tests for compiler
 153 transformations, for example, this is a simple test that shows "x-x" is being
 154 turned into zero:
 155
 156 ```mlir
 157   // RUN: mlir-opt %s -canonicalize | FileCheck %s
 158   func.func @test_subi_zero_cfg(%arg0: i32) -> i32 {
 159     %y = arith.subi %arg0, %arg0 : i32
 160     return %y: i32
 161   }
 162   // CHECK-LABEL: func @test_subi_zero_cfg(%arg0: i32)
 163   // CHECK-NEXT: %c0_i32 = arith.constant 0 : i32
 164   // CHECK-NEXT: return %c0
 165 ```
 166
 167 The "CHECK" comments are interpreted by the
 168 [LLVM FileCheck tool](https://llvm.org/docs/CommandGuide/FileCheck.html), which
 169 is sort of like a really advanced grep. This test is fully self-contained: it
 170 feeds the input into the [canonicalize pass](../Canonicalization.md), and checks
 171 that the output matches the CHECK lines. See the `test/Transforms` directory for
 172 more examples. In contrast, standard unit testing exposes the API of the
 173 underlying framework to lots and lots of tests (making it harder to refactor and
 174 move the API), typically requires a lot more code, and exacerbates issues with
 175 link time. For examples, see
 176 [the TEST_F functions in TensorFlow's testsuite](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/grappler/optimizers/arithmetic_optimizer_test.cc).
 177
 178 MLIR has been pervasively designed with this sort of design by testability,
 179 allowing us to put in place a culture that expects every behavior changing
 180 commit to include a test case, and for these test cases to be stable and
 181 reliable over time, since they are testing exactly what they are supposed to.
 182 End to end integration tests are still super useful for some things of course!
 183
 184 ### Infrastructure for Warnings and Error Diagnostics and Location Tracking
 185
 186 MLIR benefits from the lessons learned from building other compilers - including
 187 Clang which
 188 [[set the standard](http://blog.llvm.org/2010/04/amazing-feats-of-clang-error-recovery.html)](http://blog.llvm.org/2010/04/amazing-feats-of-clang-error-recovery.html)
 189 for quality of implementation in C/C++ compiler diagnostics. Drawing from this
 190 experience (and fixing mistakes in LLVM), MLIR requires that operations and
 191 functions carry abstract location information, that transformations propagate
 192 this information, and provides standardized mechanisms to emit errors and
 193 warnings, as well as for clients to hook into them to capture and report them in
 194 custom ways.
 195
 196 Why is this important? In practice, many graph-to-graph translators can fail
 197 (e.g. TF Lite when an unsupported op is used) and it is important to be able to
 198 report the error up through to the user in the most precise way possible, in
 199 order for it to be actionable. This includes tracking rewrites through fusions
 200 and fissions of ops, mapping back into language / API specific domains, etc.
 201
 202 More selfishly for infrastructure hackers, this is a huge boon because it means
 203 that it is easy to write good tests for this: the testing tools for MLIR capture
 204 the diagnostics produced by passes (using the standard diagnostic hooks) and
 205 check that they match the expected diagnostics in the testcase. For example, to
 206 test the dependence analysis infra in the code generator, Andy Davis wrote a
 207 simple pass that checks dependencies and emits them as "notes", allowing him to
 208 write tests like this:
 209
 210 ```mlir
 211   // RUN: mlir-opt %s -memref-dependence-check -verify-diagnostics
 212   func.func @different_memrefs() {
 213     %m.a = memref.alloc() : memref<100xf32>
 214     %m.b = memref.alloc() : memref<100xf32>
 215     %c0 = arith.constant 0 : index
 216     %c1 = arith.constant 1.0 : f32
 217     memref.store %c1, %m.a[%c0] : memref<100xf32>
 218     // expected-note@-1 {{dependence from memref access 0 to access 1 = false}}
 219     %v0 = memref.load %m.b[%c0] : memref<100xf32>
 220     return
 221   }
 222 ```
 223
 224 Note that a major limitation of this is that MLIR suffers from a problem of
 225 "garbage in, garbage out": if the input locations to MLIR are imprecise, then
 226 there is nothing that it can do to recover them. There is work underway in
 227 TensorFlow/Python to improve the situation, and Swift for TensorFlow already has
 228 perfect location tracking due to its design.
 229
 230 ### Shape Information Captured in the IR
 231
 232 In TensorFlow Graphs, each op takes and returns values using a very simple type
 233 system (TF_DataType) in which each value is a tensor of unknown rank and
 234 dimensions. At the same time, many graphs have static shapes easily knowable for
 235 wide swaths of the computation, and even dynamically shaped operations often
 236 have statically knowable dimensions. Many analyses and transformations benefit
 237 and use this information when available, but because TensorFlow graphs don't
 238 capture this (e.g. serialize it to proto), passes have to recompute it on demand
 239 with ShapeRefiner.
 240
 241 The [MLIR Tensor Type](../Dialects/Builtin.md/#rankedtensortype) directly
 242 captures shape information, so you can have things like:
 243
 244 ```mlir
 245   %x = tf.Add %x, %y : tensor<128 x 8 x ? x f32>
 246 ```
 247
 248 Capturing this in the IR is expected to speed up transformations (avoiding
 249 recomputing the same info over and over again) which therefore makes it
 250 practical to apply stronger shape analysis algorithms. It also makes it easier
 251 to work with the IR, because on-the-side representations can get out of date,
 252 and the API is easier to work with from an ergonomics perspective.
 253
 254 ### Unified Graph Rewriting Infrastructure
 255
 256 This is still a work in progress, but we have sightlines towards a
 257 [general rewriting infrastructure](RationaleGenericDAGRewriter.md) for
 258 transforming DAG tiles into other DAG tiles, using a declarative pattern format.
 259 DAG to DAG rewriting is a generalized solution for many common compiler
 260 optimizations, lowerings, and other rewrites and having an IR enables us to
 261 invest in building a single high-quality implementation.
 262
 263 Declarative pattern rules are preferable to imperative C++ code for a number of
 264 reasons: they are more compact, easier to reason about, can have checkers
 265 written against them, and new tools can be built that inspect and manipulate the
 266 declarative patterns in interesting ways - e.g. applying theorem provers to
 267 them. It will be exciting to see this ecosystem develop as the infrastructure
 268 matures.
 269
 270 ### Clarified Semantics for TensorFlow Operations
 271
 272 One of the challenging things about working with TensorFlow is that there are
 273 many invariants and behaviors that need to be preserved and known about when
 274 working with Graphs, and these can be difficult to reason about and lead to
 275 bugs. Things like 'dead values', Switch and Merge nodes, concurrency semantics,
 276 nodes that execute even when passed a dead value, multiple device program
 277 representation - etc... all add complexities that can make it challenging to
 278 reason about whether a transformation or analysis is correct in general. Even
 279 something as simple as constant folding or transforming integer `x-x` into `0`
 280 is non-trivial because you need to consider control dependence edges.
 281
 282 One of our major goals for the TensorFlow dialect of MLIR is to sort out these
 283 situations and upgrade existing TensorFlow graphs to semantics that are easier
 284 to reason about. The solutions to these problems are all still being debated,
 285 but those discussions have already yielded a lot of potential answers:
 286 introducing a `tf_dead_or<x>` types for switch/merge, modeling of TF operations
 287 using futures/async semantics etc. None of these particular battles are critical
 288 or important for MLIR to succeed (because of its "meta" nature, the abstraction
 289 decisions of any given dialect are up for it to decide), but each one that works
 290 out will make it easier to work with and transform TensorFlow operations. We
 291 expect these issues to get nailed down in the next couple of months when MLIR
 292 effort moves beyond TF Lite / TOCO support. The discussions that are happening
 293 now are super valuable and making progress.
 294
 295 ### Ergonomics
 296
 297 A minor-in-theory, but important-in-practice point is that MLIR is designed to
 298 make it easy, memory efficient, and less error prone to transform code than
 299 other systems. `TensorFlow::Graph` has implementation issues where the same
 300 information is stored redundantly in different places (which must be manually
 301 kept up to date), has somewhat unusual representation of certain constructs
 302 (e.g. the function library, which makes it very difficult to add or remove
 303 functions, e.g. during interprocedural transformations), and stores information
 304 in the graph that is used by the executor, but isn't necessary for program
 305 transformation.
 306
 307 TensorFlow has made a lot of progress in this area over the years, and there are
 308 lots of ideas about further improvements in the future, we are happy that MLIR
 309 addresses these needs (making it much easier to implement correct program
 310 transformations) today, and are committed to pushing hard to make it better.
 311
 312 ### Compile Time Performance and Memory Use
 313
 314 MLIR has been designed to be memory and compile-time efficient in its algorithms
 315 and data structures, using immutable and uniqued structures, low level
 316 bit-packing, and other well-known techniques to avoid unnecessary heap
 317 allocations, and allow simple and safe multithreaded optimization of MLIR
 318 programs. There are other reasons to believe that the MLIR implementations of
 319 common transformations will be more efficient than the Python and C++
 320 TensorFlow::Graph implementations of the same things, given the current
 321 implementation details of TensorFlow.
 322
 323 That said, this is very much a theory at this point. When the new implementation
 324 of various subsystems are available, we will see what happens in practice: there
 325 will be no reason to speculate - we can measure.
 326
 327 ## Common Questions and Concerns
 328
 329 Here we address some frequently asked questions and concerns.
 330
 331 ### Isn't MLIR a big dependency to take on?
 332
 333 We've heard that at least some people are concerned that MLIR is a "big"
 334 dependency to take on, and could result in large code size. Here are some key
 335 points MLIR:
 336
 337 1.  The entire MLIR codebase is a pretty small C++ code base in absolute terms
 338     compared to what goes into a modern ML framework.
 339 1.  Like LLVM, MLIR is designed as a set of libraries that clients can link in
 340     or ignore as they wish. For example, the transformations in MLIR kept
 341     separate from the core IR abstractions, and dialect specific code (e.g.
 342     TensorFlow, TF-Lite, XLA, etc) is all independently selectable by the build
 343     system. Clients that don't care about XLA don't link in that code, whether
 344     they are a TF-Lite system or a client that is completely unrelated to
 345     TensorFlow.
 346 1.  MLIR's only third party dependency is on LLVM, but it doesn't depend on LLVM
 347     IR or any other heavy dependency - it just depends on LLVM's support library
 348     which provides efficient hash tables and other
 349     [memory efficient data structures that the STL does not](http://llvm.org/docs/ProgrammersManual.html#picking-the-right-data-structure-for-a-task).
 350     There have been discussions about splitting this set of libraries out to its
 351     own subproject in LLVM that the LLVM IR project depends on. This would be
 352     great for MLIR as well as other LLVM subprojects.
 353 1.  TensorFlow and many other frameworks already use LLVM - if so, MLIR would
 354     not be pulling in an additional dependency at all.
 355
 356 ### How does MLIR represent {control flow, concurrency, …} semantics in TensorFlow?
 357
 358 MLIR provides a dialect that is an isomorphic 1-1 mapping between TensorFlow
 359 graphs and MLIR, as well as a pretty complete translator back and forth (the
 360 only known gap is that a few TF_DataType enums aren't handled yet). MLIR is a
 361 "Multi-Level IR", which allows it to represent code with different abstraction
 362 levels, so the ability to faithfully represent TensorFlow code in a completely
 363 backwards compatible way (even if there are some historical warts!) is critical.
 364
 365 In *addition* to the isomorphic mapping, we are actively working on efforts to
 366 raise the abstraction level for working with TensorFlow graphs in MLIR. Doing so
 367 would make it even easier to write TensorFlow transformations than it is today,
 368 and would provide a path to migrating TF 1.x graphs forward into the TF 2.x
 369 world. For example, because MLIR has an extensible type system, we can directly
 370 model whether it is impossible for a Tensor value to be a "dead" value - similar
 371 to the use of optional types in modern programming languages.
 372
 373 These discussions occasionally cause confusion because there are several issues
 374 being mixed up into one:
 375
 376 *   What are the current semantics of TensorFlow graphs, and what invariants can
 377     we rely on?
 378 *   What should the semantics be in TensorFlow 2.0?
 379 *   What do programs rely on in practice, and if it is unfriendly, can we
 380     migrate it?
 381 *   Can we find a way to make it so transforms don't have to worry about the
 382     complexities of Switch/Merge, by using higher level control flow
 383     representations? (tentative answer: yes)
 384 *   How should MLIR represent async vs sync operations, what invariants are
 385     provided, how does this dovetail with control flow?
 386 *   When is it safe and beneficial to perform optimizations that might reduce
 387     parallelism?
 388
 389 All of these questions have a "conservative/safe fallback": we can continue
 390 providing exactly the same abstractions that TensorFlow always has. That said,
 391 we are trying hard to level-up the representation (taking advantage of the
 392 "Multi-Level" part of MLIR) because doing so will make it much much easier to
 393 write analyses and transformations than it currently is in TensorFlow.
 394
 395 ### Non Goals
 396
 397 It is important to point out things that MLIR does not aim to do. For example,
 398 there is no runtime component to MLIR: the TensorFlow executor, the TF Lite
 399 FlatBuffer interpreter, or other existing runtime should be used as-is.
 400
 401 Another non-goal is that MLIR currently doesn't support a stable binary
 402 encoding. We will certainly add this at some point, but existing formats should
 403 be used for serialization and distribution in the meantime.