mlir/docs/Rationale/RationaleGenericDAGRewriter.md

   1 # Generic DAG Rewriter Infrastructure Rationale
   2
   3 This document details the rationale behind a general DAG-to-DAG rewrite
   4 infrastructure for MLIR. For up-to-date documentation on the user facing API,
   5 please look at the main [Pattern Rewriting document](../PatternRewriter.md).
   6
   7 ## Introduction and Motivation
   8
   9 The goal of a compiler IR is to represent code - at various levels of
  10 abstraction which pose different sets of tradeoffs in terms of representational
  11 capabilities and ease of transformation. However, the ability to represent code
  12 is not itself very useful - you also need to be able to implement those
  13 transformations.
  14
  15 There are many different types of compiler transformations, but this document
  16 focuses on a particularly important class of transformation that comes up
  17 repeatedly at scale, and is important for the goals of MLIR: matching one DAG of
  18 operations, and replacing with another. This is an integral part of many
  19 compilers and necessary for peephole optimizations like "eliminate identity
  20 nodes" or "replace x+0 with x", a generalized canonicalization framework (e.g.
  21 Instruction Combiner in LLVM), as well as a useful abstraction to implement
  22 optimization algorithms for optimization algorithms for IR at multiple levels.
  23
  24 A particular strength of MLIR (and a major difference vs other compiler
  25 infrastructures like LLVM, GCC, XLA, TensorFlow, etc) is that it uses a single
  26 compiler IR to represent code at multiple levels of abstraction: an MLIR
  27 operation can be a "TensorFlow operation", an "XLA HLO", an Affine Loop Nest, an
  28 LLVM IR instruction (transitively including X86, Lanai, PTX, and other target
  29 specific instructions), or anything else that the MLIR operation system can
  30 reasonably express. Given that MLIR spans such a wide range of different problem
  31 scopes, a single infrastructure for performing graph-to-graph rewrites can help
  32 solve many diverse domain challenges.
  33
  34 [Static single assignment](https://en.wikipedia.org/wiki/Static_single_assignment_form)
  35 (SSA) representations like MLIR make it easy to access the operands and "users"
  36 of an operation. As such, a natural abstraction for these graph-to-graph
  37 rewrites is that of DAG pattern matching: clients define DAG tile patterns
  38 (where a tile is a sequence of operations defining a subgraph of the DAG), and
  39 each pattern includes a result DAG to produce and the cost of the result (or,
  40 inversely, the benefit of doing the replacement). A common infrastructure
  41 efficiently finds and performs the rewrites.
  42
  43 While this concept is simple, the details are more nuanced. This document
  44 defines and explores a set of abstractions that can solve a wide range of
  45 different problems, and be applied to many different sorts of problems that MLIR
  46 is - and is expected to - face over time. We do this by separating the pattern
  47 application algorithm from the "driver" of the computation loop, and make space
  48 for the patterns to be defined declaratively.
  49
  50 ### Constant folding
  51
  52 A degenerate but pervasive case of DAG-to-DAG pattern matching is constant
  53 folding: an operation whose operands contain constants can often be folded to a
  54 result constant value.
  55
  56 MLIR operations may override a
  57 [`fold`](../Canonicalization.md/#canonicalizing-with-the-fold-method) routine, which
  58 exposes a simpler API compared to a general DAG-to-DAG pattern matcher, and
  59 allows for it to be applicable in cases that a generic matcher would not. For
  60 example, a DAG-rewrite can remove arbitrary nodes in the current function, which
  61 could invalidate iterators. Constant folding as an API does not remove any
  62 nodes, it just provides a (list of) constant values and allows the clients to
  63 update their data structures as necessary.
  64
  65 ## Related Work
  66
  67 There is a huge amount of related work to consider, given that nearly every
  68 compiler in existence has to solve this problem many times over. One unifying
  69 problem is that all of these systems are designed to solve one particular, and
  70 usually, narrow problem: MLIR on the other hand would like to solve many of
  71 these problems within a single infrastructure. Here are a few related graph
  72 rewrite systems, along with the pros and cons of their work (The most similar
  73 design to the infrastructure present in MLIR is the LLVM DAG-to-DAG instruction
  74 selection algorithm).
  75
  76 ### AST-Level Pattern Matchers
  77
  78 The literature is full of source-to-source translators which transform
  79 identities in order to improve performance (e.g. transforming `X*0` into `0`).
  80 One large example is the GCC `fold` function, which performs
  81 [many optimizations](https://github.com/gcc-mirror/gcc/blob/master/gcc/fold-const.c)
  82 on ASTs. Clang has
  83 [similar routines](https://clang.llvm.org/docs/InternalsManual.html#constant-folding-in-the-clang-ast)
  84 for simple constant folding of expressions (as required by the C++ standard) but
  85 doesn't perform general optimizations on its ASTs.
  86
  87 The primary downside of AST optimizers is that you can't see across operations
  88 that have multiple uses. It is
  89 [well known in literature](https://llvm.org/pubs/2008-06-LCTES-ISelUsingSSAGraphs.pdf)
  90 that DAG pattern matching is more powerful than tree pattern matching, but on
  91 the other hand, DAG pattern matching can lead to duplication of computation
  92 which needs to be checked for.
  93
  94 ### "Combiners" and other peephole optimizers
  95
  96 Compilers end up with a lot of peephole optimizers for various things, e.g. the
  97 GCC
  98 ["combine" routines](https://github.com/gcc-mirror/gcc/blob/master/gcc/combine.c)
  99 (which try to merge two machine instructions into a single one), the LLVM
 100 [Inst Combine](https://github.com/llvm/llvm-project/tree/main/llvm/lib/Transforms/InstCombine)
 101 [pass](https://llvm.org/docs/Passes.html#instcombine-combine-redundant-instructions),
 102 LLVM's
 103 [DAG Combiner](https://github.com/llvm-mirror/llvm/blob/master/lib/CodeGen/SelectionDAG/DAGCombiner.cpp),
 104 the Swift compiler's
 105 [SIL Combiner](https://github.com/apple/swift/tree/main/lib/SILOptimizer/SILCombiner),
 106 etc. These generally match one or more operations and produce zero or more
 107 operations as a result. The LLVM
 108 [Legalization](https://github.com/llvm/llvm-project/tree/main/llvm/lib/CodeGen/SelectionDAG)
 109 infrastructure has a different outer loop but otherwise works the same way.
 110
 111 These passes have a lot of diversity, but also have a unifying structure: they
 112 mostly have a worklist outer loop which visits operations. They then use a
 113 visitor pattern (or equivalent) to switch over the class of operation and
 114 dispatch to a method. That method contains a long list of hand-written C++ code
 115 that pattern-matches various special cases. LLVM introduced a "match" function
 116 that allows writing patterns in a somewhat more declarative style using template
 117 metaprogramming (MLIR has similar facilities). Here's a simple example:
 118
 119 ```c++
 120   // Y - (X + 1) --> ~X + Y
 121   if (match(Op1, m_OneUse(m_Add(m_Value(X), m_One()))))
 122     return BinaryOperator::CreateAdd(Builder.CreateNot(X), Op0);
 123 ```
 124
 125 Here is a somewhat more complicated one (this is not the biggest or most
 126 complicated :)
 127
 128 ```c++
 129   // C2 is ODD
 130   // LHS = XOR(Y,C1), Y = AND(Z,C2), C1==(C2+1) => LHS == NEG(OR(Z, ~C2))
 131   // ADD(LHS, RHS) == SUB(RHS, OR(Z, ~C2))
 132   if (match(LHS, m_Xor(m_Value(Y), m_APInt(C1))))
 133     if (C1->countTrailingZeros() == 0)
 134       if (match(Y, m_And(m_Value(Z), m_APInt(C2))) && *C1 == (*C2 + 1)) {
 135         Value NewOr = Builder.CreateOr(Z, ~(*C2));
 136         return Builder.CreateSub(RHS, NewOr, "sub");
 137       }
 138 ```
 139
 140 These systems are simple to set up, and pattern matching templates have some
 141 advantages (they are extensible for new sorts of sub-patterns, look compact at
 142 point of use). On the other hand, they have lots of well known problems, for
 143 example:
 144
 145 *   These patterns are very error prone to write, and contain lots of
 146     redundancies.
 147 *   The IR being matched often has identities (e.g. when matching commutative
 148     operators) and the C++ code has to handle it manually - take a look at
 149     [the full code](https://github.com/llvm/llvm-project/blob/c0b5000bd848303320c03f80fbf84d71e74518c9/llvm/lib/Transforms/InstCombine/InstCombineAddSub.cpp#L767)
 150     for `checkForNegativeOperand` that defines the second pattern).
 151 *   The matching code compiles slowly, both because it generates tons of code
 152     and because the templates instantiate slowly.
 153 *   Adding new patterns (e.g. for count leading zeros in the example above) is
 154     awkward and doesn't often happen.
 155 *   The cost model for these patterns is not really defined - it is emergent
 156     based on the order the patterns are matched in code.
 157 *   They are non-extensible without rebuilding the compiler.
 158 *   It isn't practical to apply theorem provers and other tools to these
 159     patterns - they cannot be reused for other purposes.
 160
 161 In addition to structured "combiners" like these, there are lots of ad-hoc
 162 systems like the
 163 [LLVM Machine code peephole optimizer](http://llvm.org/viewvc/llvm-project/llvm/trunk/lib/CodeGen/PeepholeOptimizer.cpp?view=markup)
 164 which are related.
 165
 166 ### LLVM's DAG-to-DAG Instruction Selection Infrastructure
 167
 168 The instruction selection subsystem in LLVM is the result of many years worth of
 169 iteration and discovery, driven by the need for LLVM to support code generation
 170 for lots of targets, the complexity of code generators for modern instruction
 171 sets (e.g. X86), and the fanatical pursuit of reusing code across targets. Eli
 172 Bendersky wrote a
 173 [nice short overview](https://eli.thegreenplace.net/2013/02/25/a-deeper-look-into-the-llvm-code-generator-part-1)
 174 of how this works, and the
 175 [LLVM documentation](https://llvm.org/docs/CodeGenerator.html#select-instructions-from-dag)
 176 describes it in more depth including its advantages and limitations. It allows
 177 writing patterns like this.
 178
 179 ```
 180 def : Pat<(or GR64:$src, (not (add GR64:$src, 1))),
 181           (BLCI64rr GR64:$src)>;
 182 ```
 183
 184 This example defines a matcher for the
 185 ["blci" instruction](https://en.wikipedia.org/wiki/Bit_Manipulation_Instruction_Sets#TBM_\(Trailing_Bit_Manipulation\))
 186 in the
 187 [X86 target description](https://github.com/llvm/llvm-project/blob/main/llvm/lib/Target/X86/X86InstrInfo.td),
 188 there are many others in that file (look for `Pat<>` patterns, since they aren't
 189 entangled in details of the compiler like assembler/disassembler generation
 190 logic).
 191
 192 For the purposes of MLIR, there is much to like about this system, for example:
 193
 194 *   It is defined in a declarative format.
 195 *   It is extensible to target-defined operations.
 196 *   It automates matching across identities, like commutative patterns.
 197 *   It allows custom abstractions and intense factoring of target-specific
 198     commonalities.
 199 *   It generates compact code - it compiles into a state machine, which is
 200     interpreted.
 201 *   It allows the instruction patterns to be defined and reused for multiple
 202     purposes.
 203 *   The patterns are "type checked" at compile time, detecting lots of bugs
 204     early and eliminating redundancy from the pattern specifications.
 205 *   It allows the use of general C++ code for weird/complex cases.
 206
 207 While there is a lot that is good here, there are also a few undesirable bits:
 208
 209 *   The representation is specifically designed and only applicable for
 210     instruction selection, meaning that the directly adjacent problems like the
 211     DAGCombiner and Legalizer can't use it.
 212 *   This isn't extensible at compiler runtime, you have to rebuild the compiler
 213     to extend it.
 214 *   The error messages when failing to match a pattern
 215     [are not exactly optimal](https://www.google.com/search?q=llvm+cannot+select).
 216 *   It has lots of implementation problems and limitations (e.g. can't write a
 217     pattern for a multi-result operation) as a result of working with the
 218     awkward SelectionDAG representation and being designed and implemented on
 219     demand.
 220 *   Organic growth over time has left lots of sharp edges.
 221
 222 ### Summary
 223
 224 MLIR faces a wide range of pattern matching and graph rewrite problems, and one
 225 of the major advantages of having a common representation for code at multiple
 226 levels is that it allows for investing in - and highly leveraging - a single
 227 infrastructure for doing this sort of work.
 228
 229 ## Goals
 230
 231 We'd like the to encompass many problems in the MLIR space, including 1-to-N
 232 expansions (e.g. such as in type legalization during instruction selection when
 233 an add of one bit width may be split into multiple adds of a smaller bit width),
 234 M-to-1 patterns (e.g. when converting a multiply+add into a single muladd
 235 operation), as well as general M-to-N patterns (e.g. instruction selection for
 236 target instructions). Patterns have a benefit associated with them, and the
 237 common infrastructure should be responsible for sorting out the highest benefit
 238 match for a given application.
 239
 240 We separate the task of picking a particular optimal pattern from a given root
 241 node, the algorithm used to rewrite an entire graph given a particular set of
 242 goals, and the definition of the patterns themselves. We do this because DAG
 243 tile pattern matching is NP complete. Additionally, we would like to support
 244 iterative rewrite algorithms that progressively transform the input program
 245 through multiple steps. Furthermore, we would like to support many different
 246 sorts of clients across the MLIR stack, and they may have different tolerances
 247 for compile time cost, different demands for optimality, and other algorithmic
 248 goals or constraints.
 249
 250 We aim for MLIR transformations to be easy to implement and reduce the
 251 likelihood for compiler bugs. We expect there to be a very large number of
 252 patterns that are defined over time, and we believe that these sorts of patterns
 253 will have a very large number of legality/validity constraints - many of which
 254 are difficult to reason about in a consistent way, may be target specific, and
 255 whose implementation may be particularly bug-prone. As such, we aim to design
 256 the API around pattern definition to be simple, resilient to programmer errors,
 257 and allow separation of concerns between the legality of the nodes generated
 258 from the idea of the pattern being defined.
 259
 260 Finally, error handling is a topmost concern, we want pattern match failures to
 261 be diagnosable in a reasonable way. This is a difficult problem in general, as
 262 the space of malfunction is too great to be fully enumerated and handled
 263 optimally, but MLIR is already designed to represent the provenance of an
 264 operation well. The aim of the pattern rewriting infrastructure is simply to
 265 propagate that provenance information precisely, as well as diagnose pattern
 266 match failures with the rationale for why a set of patterns do not apply.
 267
 268 ### Non goals
 269
 270 The pattern infrastructure does not aim to solve all compiler problems, it is
 271 simply a DAG-to-DAG pattern matching system. Compiler algorithms that require
 272 global dataflow analysis (e.g. common subexpression elimination, conditional
 273 constant propagation, and many many others) will not be directly solved by this
 274 infrastructure.
 275
 276 This infrastructure is limited to DAG patterns, which (by definition) prevent
 277 the patterns from seeing across cycles in a graph. In an SSA-based IR like MLIR,
 278 this means that these patterns don't see across basic block arguments. We
 279 consider this acceptable given the set of problems we are trying to solve - we
 280 don't know of any other system that attempts to do so, and consider the payoff
 281 of worrying about this to be low.
 282
 283 This design includes the ability for DAG patterns to have associated benefits,
 284 but those benefits are defined in terms of magic numbers (typically equal to the
 285 number of nodes being replaced). For any given application, the units of magic
 286 numbers will have to be defined.