mlir/docs/Rationale/RationaleLinalgDialect.md

   1 # Linalg Dialect Rationale: The Case For Compiler-Friendly Custom Operations
   2
   3 [TOC]
   4
   5 ## Introduction<a name="introduction"></a>
   6
   7 ### Positioning
   8
   9 <img width="180" align="left" alt="MLIR Codegen Flow" src="https://user-images.githubusercontent.com/10148468/73613629-c5586580-45c5-11ea-94b7-074aeea94c7b.png">
  10
  11 This document describes the key design principles
  12 that led to the existing implementation of Linalg and aims at exposing
  13 the tradeoffs involved when building higher-level Intermediate
  14 Representations (IR) and Dialects to facilitate code
  15 generation. Consider the simplified schema describing codegen in MLIR.
  16 Linalg is designed to solve the High-level Hierarchical Optimization
  17 (HHO box) and to interoperate nicely within a
  18 *Mixture Of Expert Compilers* environment (i.e. the *CGSel* box).
  19 This work is inspired by a wealth of [prior art](#prior-art) in
  20 the field, from which it seeks to learn key lessons. This documentation
  21 and introspection effort also comes in the context of the proposal for a
  22 working group for discussing the [Development of high-level Tensor Compute
  23 Primitives dialect(s) and
  24 transformations](https://llvm.discourse.group/t/development-of-high-level-tensor-compute-primitives-dialect-s-and-transformations/388/3).
  25 We hope that the lessons from prior art, the design principles outlined in
  26 this doc and the architecture of Linalg can help inform the community on a
  27 path to defining these High-Level Tensor Compute Primitives.
  28
  29 ### Inception
  30
  31 Linalg started as a pragmatic dialect to bootstrap code generation in MLIR, by
  32 *defining away* complex code generation problems like precise dependence
  33 analysis or polyhedral code generation and by introducing the ability to call
  34 into fast library implementations when available. Linalg **defines ops and
  35 transformations declaratively**  and was originally restricted to ops with
  36 *linear-algebra like* semantics (`pointwise`, `matmul`, `conv`...). This
  37 approach enables building a high-level productivity-first codegen solution that
  38 leverages *both* compiler optimizations *and* efficient library implementations
  39 so as not to miss out on simple performance benefits. For example, if
  40 one's favorite HPC library or ISA has a `matmul` primitive running at 95% of
  41 the achievable peak performance, for operands stored in some memory, one should
  42 be able to **use the primitive** when possible *and* generate code otherwise.
  43
  44 However, as the design of Linalg co-evolved with the design of MLIR, it became
  45 apparent that it could extend to larger application domains than just machine
  46 learning on dense tensors.
  47
  48 The design and evolution of Linalg follow a *codegen-friendly* approach where
  49 the IR and the transformations evolve hand-in-hand.
  50 The key idea is that op semantics *declare* and transport information that is
  51 traditionally obtained by compiler analyses.
  52 This information captures the legality and applicability of transformations and
  53 is **not lost by lowering prematurely to loop or CFG form**. The key
  54 transformations are designed so as to **preserve this information** as long as
  55 necessary. For example, `linalg.matmul` remains `linalg.matmul` after tiling
  56 and fusion.
  57
  58 Furthermore, Linalg decouples transformation validity from profitability
  59 considerations and voluntarily leaves the latter aside in the first iteration
  60 (see the [suitability for search](#suitability_for_search) guiding principle).
  61
  62 The first incarnation of these ideas was presented as an example at the
  63 EuroLLVM 2019 developer's meeting as part of the
  64 [Linalg section](https://llvm.org/devmtg/2019-04/slides/Tutorial-AminiVasilacheZinenko-MLIR.pdf)
  65 of the first [MLIR Tutorial](https://www.youtube.com/watch?v=cyICUIZ56wQ).
  66
  67 ### Evolution
  68 Since the initial implementation, the design has evolved with, and partially
  69 driven the evolution of the core MLIR infrastructure to use
  70 [Regions](../LangRef.md/#regions),
  71 [OpInterfaces](../Interfaces.md),
  72 [ODS](../DefiningDialects/Operations.md) and
  73 [Declarative Rewrite Rules](../DeclarativeRewrites.md)
  74 among others. The approach adopted by Linalg was extended to become
  75 [StructuredOps abstractions](
  76 https://drive.google.com/drive/u/0/folders/1sRAsgsd8Bvpm_IxREmZf2agsGU2KvrK-),
  77 with Linalg becoming its incarnation on tensors and buffers.
  78 It is complemented by the
  79 [Vector dialect](../Dialects/Vector.md),
  80 which defines structured operations on vectors, following the same rationale and
  81 design principles as Linalg. (Vector dialect includes the higher-level
  82 operations on multi-dimensional vectors and abstracts away the lowering to
  83 single-dimensional vectors).
  84
  85 The Linalg dialect itself grew beyond linear algebra-like operations to become
  86 more expressive, in particular by providing an abstraction of a loop nest
  87 supporting parallelism, reductions and sliding windows around arbitrary MLIR
  88 [regions](../LangRef.md/#regions). It also has the
  89 potential of growing beyond *dense* linear-algebra to support richer data
  90 types, such as sparse and ragged tensors and buffers.
  91
  92 Linalg design remains open to evolution and cross-pollination with other
  93 dialects and approaches. It has been successfully used as the staging ground
  94 for code generation-related abstractions, spinning off the generalization of
  95 the following:
  96 - the `!linalg.view` type folded into the *"Strided MemRef"* type while
  97 preserving structure to allow calling into external C++ libraries with
  98 unsurprising ABI conventions;
  99 - the `linalg.view` and `linalg.subview` ops evolved into the standard dialect;
 100 - the `linalg.for`, `linalg.load` and `linalg.store` ops evolved into a prelude
 101 to the *structured control flow* dialect (named `LoopOps`).
 102 More components can be extracted, redesigned and generalized when new uses or
 103 requirements arise.
 104
 105 Several [design questions](../Dialects/Linalg/_index.md/#open_issues) remain
 106 open in Linalg, which does not claim to be a general solution to all compilation
 107 problems. It does aim at driving thinking and implementations of domain-specific
 108 abstractions where programmer's intent can be captured at a very high level,
 109 directly in the IR.
 110
 111 Given the evolution of the scope, it becomes apparent that a better name than
 112 "Linalg" could remove some of the confusions related to the dialect (and the
 113 underlying approach), its goals and limitations.
 114
 115 ## Prior Art
 116 Linalg draws inspiration from decades of prior art to design a modern a
 117 pragmatic solution. The following non-exhaustive list refers to some of the
 118 projects that influenced Linalg design:
 119
 120 - [ONNX](https://onnx.ai/),
 121 - [LIFT](https://www.lift-project.org/),
 122 - [XLA](https://www.tensorflow.org/xla/architecture),
 123 - [Halide](https://halide-lang.org/) and [TVM](https://tvm.apache.org/),
 124 - [TACO](http://tensor-compiler.org/),
 125 - [Darkroom](http://darkroom-lang.org/) and [Terra](http://terralang.org/),
 126 - [Sigma-LL](http://spiral.ece.cmu.edu:8080/pub-spiral/pubfile/cgo16-preprint_248.pdf),
 127 - [Tensor Comprehensions](https://arxiv.org/abs/1802.04730),
 128 - [Polyhedral Compilers](https://en.wikipedia.org/wiki/Polytope_model),
 129 - the [Affine dialect](https://mlir.llvm.org/docs/Dialects/Affine/) in MLIR,
 130 - Generic Loop Transformations (see Ken Kennedy's
 131 [Optimizing Compilers for Modern Architectures](
 132 https://www.elsevier.com/books/optimizing-compilers-for-modern-architectures/allen/978-0-08-051324-9))
 133 - Traditional compiler CFGs with SSA forms.
 134
 135 Additionally, experience with the following tools proved very valuable when
 136 thinking holistically about how all these components interplay all the way
 137 up to the user and down to the hardware:
 138
 139 - the [Torch](http://torch.ch/) machine-learning framework,
 140 - the LLVM compiler, specifically in JIT mode,
 141 - high-performance libraries (MKL, CUBLAS, FBFFT)
 142 - the [PeachPy](https://www.cs.utexas.edu/users/flame/BLISRetreat/BLISRetreatTalks/PeachPy.pdf) assembler
 143 - current and potentially upcoming hardware ISAs.
 144
 145 The novelty of MLIR's code base and its unprecedented support for defining and
 146 mixing abstractions, enabling one to reflect on and integrate the key elements
 147 of the prior art success as well as avoid the common pitfalls in the area of
 148 code generation. Thus, instead of diverging into a discussion about the
 149 implications of adopting any of the existing solutions, Linalg had the
 150 possibility to build on all of them and learn from their experience while
 151 leveraging the benefit of hindsight.
 152
 153 The following reflections on prior art have influenced the design of Linalg.
 154 The discussion is by no means exhaustive but should capture the key motivations
 155 behind Linalg.
 156
 157 ### Lessons from ONNX<a name="lessonsonnx"></a>
 158 ONNX is a specification of operations that appear in Machine Learning
 159 workloads. As such, it is predominantly driven by the expressiveness requirements
 160 of ML, and less by the considerations of IR design for HPC code generation.
 161
 162 Similarly to ONNX, Linalg defines *"semantically charged" named ops*.
 163 But it also considers *transformations on these ops* as a key component and
 164 defines the IR to support the transformations, preferring transformations over
 165 expressiveness if necessary.
 166
 167 Linalg hopes to additionally address the following:
 168 - facilitate frontend-compiler co-design by taking into account compiler
 169   transformations and lowerings in op definition;
 170 - minimize the set of available ops by making them non-overlapping with each
 171   other, thus simplifying the intermediate representation.
 172
 173 ### Lessons from LIFT<a name="lessonslift"></a>
 174 [LIFT](https://www.lift-project.org/) is a system to write computational
 175 kernels based on functional abstractions. Transformations are
 176 represented by additional nodes in the IR, whose semantics are at the
 177 level of the algorithm (e.g. `partialReduce`).
 178 LIFT applies and composes transformations by using [local rewrite
 179 rules](https://www.lift-project.org/presentations/2015/ICFP-2015.pdf) that
 180 embed these additional nodes directly in the functional abstraction.
 181
 182 Similarly to LIFT, Linalg uses local rewrite rules implemented with the MLIR
 183 [Declarative Rewrite Rules](../DeclarativeRewrites.md)
 184 mechanisms.
 185
 186 Linalg builds on, and helps separate concerns in the LIFT approach as follows:
 187 - transformations are either separated from the representation or expressed as
 188   composable attributes that are independent of the actual computation,
 189   avoiding intricate effects on performance;
 190 - abstractions are split into smaller components (e.g., control flow and data
 191   structure abstractions) potentially reusable across different dialects in the
 192   MLIR's open ecosystem.
 193
 194 LIFT is expected to further influence the design of Linalg as it evolves. In
 195 particular, extending the data structure abstractions to support non-dense
 196 tensors can use the experience of LIFT abstractions for
 197 [sparse](https://www.lift-project.org/publications/2016/harries16sparse.pdf)
 198 and [position-dependent
 199 arrays](https://www.lift-project.org/publications/2019/pizzuti19positiondependentarrays.pdf).
 200
 201 ### Lessons from XLA<a name="lessonsxla"></a>
 202 [XLA](https://www.tensorflow.org/xla/architecture) is one of the first
 203 post-Theano ML compilers that was introduced as a pragmatic compilation
 204 solution for TensorFlow. It shines on Google's xPU
 205 hardware and is an important piece of the puzzle. It is particularly good at
 206 (1) transforming code back and forth between the scalar and the vector
 207 worlds, (2) passing function boundaries for handling both host and device
 208 code, and (3) complying to stringent requirements imposed by energy-efficient
 209 xPUs.
 210 XLA followed a pragmatic design process where the compiler is given perfect
 211 knowledge of each op's semantic, all starting from the mighty `conv` and
 212 `matmul` ops. XLA transformations consist of writing emitters that compose, as C++
 213 functions. Perfect op semantics knowledge has 2 big benefits: (1) transformations are
 214 correct by construction (2) very strong performance on difficult xPU targets.
 215
 216 Similarly, Linalg ops *"know their semantics"* and *"know how to transform and
 217 lower themselves"*. The means by which this information is made available and
 218 how it is used in MLIR are, however, very different.
 219
 220 Linalg hopes to additionally address the following:
 221 - HLOs are expressive as a whole, but each op has very limited and fixed
 222 semantics: ops are not configurable. As a consequence, HLOs have evolved into
 223 a too large set of ops whose semantics intersect.
 224 This echoes the ops proliferation problem also exhibited by ONNX.
 225 - Reliance on perfect op knowledge leads to situations where transformations and
 226 ops end up needing to know about each other's semantics (e.g. during fusion).
 227 Since the transformations themselves are not simple local rewrite patterns
 228 (unlike LIFT), code complexity grows quickly.
 229 - XLA lacks an independent IR that can be inspected, unit tested and used
 230 independently. This monolithic design makes the system not portable: xPU passes
 231 and GPU passes do not share much code.
 232
 233 ### Lessons from Halide and TVM<a name="lessonshalide"></a>
 234 [Halide](https://halide-lang.org/) is a DSL embedded in C++ that provides a
 235 way of metaprogramming the HalideIR and applying transformations declaratively
 236 to let the expert user transform and optimize the program in tailored ways.
 237 Halide, initially targeted the SIGGRAPH community but is now more generally
 238 applicable. [TVM](https://tvm.apache.org/) is an evolution of Halide into the
 239 machine learning and deep-neural network space, based on HalideIR.
 240
 241 The Halide transformation methodology follows similar principles to the
 242 [URUK](http://icps.u-strasbg.fr/~bastoul/research/papers/GVBCPST06-IJPP.pdf)
 243 and
 244 [CHiLL](https://pdfs.semanticscholar.org/6a46/20589f63f3385707d2d590f7b7dc8ee4d74f.pdf)
 245 compiler transformation frameworks, but without the strengths (and especially
 246 complexity) of the polyhedral model.
 247
 248 Halide particularly shines at making the HPC transformation methodology
 249 accessible to $\Omega$(10-100) users, at a time when polyhedral tools are
 250 still only accessible to $\Omega$(1-10) users. Halide makes heavy usage of
 251 canonicalization rules that are also very prevalent in MLIR.
 252
 253 Linalg hopes to additionally address the following:
 254 - Halide scheduling is powerful and explores a large swath of possible
 255 transformations. But it's still too hard for newcomers to use or extend. The
 256 level of performance you get from Halide is very different depending on
 257 whether one is a seasoned veteran or a newcomer. This is especially true as
 258 the number of transformations grows.
 259 - Halide raises rather than lowers in two ways, going counter-current to the
 260 design goals we set for high-level codegen abstractions in MLIR. First,
 261 canonical Halide front-end code uses explicit indexing and math on scalar
 262 values, so to target BLAS/DNN libraries one needs to add pattern matching
 263 which is similarly brittle as in the affine case. While Halide's performance
 264 is on par with the libraries on programmable targets (CPU/GPU), that
 265 approach doesn't work on mobile accelerators or on xPUs, where the framework
 266 ingests whole-tensor operations.
 267 Second, reductions and scans are expressed using serial iteration, again
 268 requiring pattern matching before they can be transformed (e.g. to do a
 269 reduction using atomics, or hierarchically). The lesson to draw is that we
 270 should start with higher-level primitives than Halide.
 271
 272 ### Lessons from Tensor Comprehensions<a name="lessonstc"></a>
 273 [Tensor Comprehensions](https://arxiv.org/abs/1802.04730) is a
 274 high-level language to express tensor computations with a syntax
 275 generalizing the Einstein notation, coupled to an end-to-end
 276 compilation flow capable of lowering to efficient GPU code. It was
 277 integrated with 2 ML frameworks: Caffe2 and PyTorch.
 278
 279 <img width="600" alt="MLIR Codegen Flow"
 280 src="https://user-images.githubusercontent.com/10148468/73613272-df904480-45c1-11ea-88f9-214dee7464cf.png">
 281
 282 The compilation flow combines [Halide](#lessonshalide) and a Polyhedral Compiler
 283 derived from [ISL](https://en.wikipedia.org/wiki/Integer_set_library)
 284 and uses both HalideIR and the ISL *schedule-tree* IR.
 285 The compiler provides a collection of polyhedral compilation
 286 algorithms to perform fusion and favor multi-level parallelism and
 287 promotion to deeper levels of the memory hierarchy.
 288 Tensor Comprehensions showed that, fixing a few predefined strategies
 289 with parametric transformations and tuning knobs, can already provide
 290 great results. In that previous work, simple
 291 genetic search combined with an autotuning framework was sufficient
 292 to find good implementations in the ***non-compute bound regime***.
 293 This requires code versions obtainable by the
 294 various transformations to encompass versions that get close to the
 295 roofline limit.
 296 The ultimate goal of Tensor Comprehensions was to concretely mix
 297 Halide high-level transformations with polyhedral mid-level
 298 transformations and build a pragmatic system that could take advantage
 299 of both styles of compilation.
 300
 301 Linalg hopes to additionally address the following:
 302 - Halide was never properly used in Tensor Comprehensions beyond shape
 303 inference. Most of the investment went into simplifying polyhedral
 304 transformations and building a usable end-to-end system. MLIR was
 305 deemed a better infrastructure to mix these types of compilation.
 306 - The early gains provided by reusing established infrastructures
 307 (HalideIR and ISL schedule trees) turned into more impedance mismatch
 308 problems than could be solved with a small tactical investment.
 309 - Tensor Comprehensions emitted CUDA code which was then JIT compiled
 310 with NVCC from a textual representation. While this was a pragmatic
 311 short-term solution it made it hard to perform low-level rewrites that
 312 would have helped with register reuse in the ***compute-bound regime***.
 313 - The same reliance on emitting CUDA code made it difficult to
 314 create cost models when time came. This made it artificially harder to
 315 prune out bad solutions than necessary. This resulted in excessive
 316 runtime evaluation, as reported in the paper [Machine Learning Systems
 317 are Stuck in a Rut](https://dl.acm.org/doi/10.1145/3317550.3321441).
 318
 319 Many of those issues are naturally addressed by implementing these ideas
 320 in the MLIR infrastructure.
 321
 322 ### Lessons from Polyhedral compilers<a name="lessonspolyhedral"></a>
 323 The polyhedral model has been on the cutting edge of loop-level optimization for
 324 decades, with several incarnations in production compilers such as
 325 [GRAPHITE](https://gcc.gnu.org/wiki/Graphite) for GCC and
 326 [Polly](https://polly.llvm.org) for LLVM. Although it has proved crucial to
 327 generate efficient code from domain-specific languages such as
 328 [PolyMage](http://mcl.csa.iisc.ac.in/polymage.html) and [Tensor
 329 Comprehensions](https://dl.acm.org/doi/abs/10.1145/3355606), it has never been
 330 fully included into mainstream general-purpose optimization pipelines. Detailed
 331 analysis of the role of polyhedral transformations is provided in the
 332 [simplified polyhedral
 333 form](RationaleSimplifiedPolyhedralForm.md) document
 334 dating back to the inception of MLIR.
 335
 336 In particular, polyhedral abstractions have proved challenging to integrate with
 337 a more conventional compiler due to the following.
 338 - The transformed code (or IR) quickly gets complex and thus hard to analyze and
 339   understand.
 340 - Code generation from the mathematical form used in the polyhedral model relies
 341   on non-trivial exponentially complex algorithms.
 342 - The mathematical form is rarely composable with the SSA representation and
 343   related algorithms, on which most mainstream compilers are built today.
 344 - Expressiveness limitations, although addressed in the scientific literature
 345   through, e.g., summary functions, often remain present in actual
 346   implementations.
 347
 348 The Affine dialect in MLIR was specifically designed to address the integration
 349 problems mention above. In particular, it maintains the IR in the same form
 350 (loops with additional constraints on how the bounds are expressed) throughout
 351 the transformation, decreasing the need for one-shot conversion between
 352 drastically different representations. It also embeds the polyhedral
 353 representation into the SSA form by using MLIR regions and thus allows one to
 354 combine polyhedral and SSA-based transformations.
 355
 356 ### Lessons from the Affine dialect<a name="lessonsaffine"></a>
 357 The Affine dialect in MLIR brings the polyhedral abstraction closer to the
 358 conventional SSA representation. It addresses several long-standing integration
 359 challenges as described above and is likely to be more suitable when compiling
 360 from a C language-level abstraction.
 361
 362 MLIR makes it possible to start from a higher-level abstraction than C, for
 363 example in machine learning workloads. In such cases, it may be possible to
 364 avoid complex analyses (data-flow analysis across loop iterations is
 365 exponentially complex) required for polyhedral transformation by leveraging the
 366 information available at higher levels of abstractions, similarly to DSL
 367 compilers. Linalg intends to use this information when available and ensure
 368 *legality of transformations by construction*, by integrating legality
 369 preconditions in the op semantics (for example, loop tiling can be applied to
 370 the loop nest computing a matrix multiplication, no need to additionally rely on
 371 affine dependence analysis to check this). This information is not readily
 372 available in the Affine dialect, and can only be derived using potentially
 373 expensive pattern-matching algorithms.
 374
 375 Informed by the practical experience in polyhedral compilation and with the
 376 Affine dialects in particular, Linalg takes the following decisions.
 377 - **Discourage loop skewing**: the loop skewing transformation, that is
 378   sometimes used to enable parallelization, often has surprising (negative)
 379   effects on performance. In particular, polyhedral auto-transformation can be
 380   expressed in a simpler way without loop skewing; skewing often leads to
 381   complex control flow hampering performance on accelerators such as GPUs.
 382   Moreover, the problems loop skewing addresses can be better addressed by other
 383   approaches, e.g., diamond tiling. In the more restricted case of ML workloads,
 384   multi-for loops with induction variables independent of each other (referred
 385   to as hyper-rectangular iteration domains in the literature) such as the
 386   proposed
 387   [affine.parallel](https://llvm.discourse.group/t/rfc-add-affine-parallel/350)
 388   are sufficient in the majority of cases.
 389 - **Declarative Tiling**: the *tiling* transformation is ubiquitous in HPC code
 390   generation. It can be seen as a decomposition of either the iteration space or
 391   the data space into smaller regular parts, referred to as tiles. Polyhedral
 392   approaches, including the Affine dialect, mostly opt for iteration space
 393   tiling, which introduces additional control flow and complex address
 394   expressions. If the tile sizes are not known during the transformation (so
 395   called parametric tiling), the address expressions and conditions quickly
 396   become non-affine or require exponentially complex algorithms to reason about
 397   them. Linalg focuses tiling on the data space instead, creating views into the
 398   buffers that leverage MLIR's strided `memref` abstraction. These views compose
 399   and the complexity of access expressions remains predictable.
 400 - **Preserve high-level information**: Linalg maintains the information provided
 401   by the op semantics as long as necessary for transformations. For example, the
 402   result of tiling a matrix multiplication is loops around a smaller matrix
 403   multiplication. Even with pattern-matching on top of the Affine dialect, this
 404   would have required another step of pattern-matching after the transformation.
 405
 406 Given these choices, Linalg intends to be a better fit for **high-level
 407 compilation** were significantly more information is readily available in the
 408 input representation and should be leveraged before lowering to other
 409 abstractions. Affine remains a strong abstraction for mid-level transformation
 410 and is used as a lowering target for Linalg, enabling further transformations
 411 and combination of semantically-loaded and lower-level inputs. As such, Linalg
 412 is intended to complement Affine rather than replace it.
 413
 414 ## Core Guiding Principles<a name="guiding_principles"></a>
 415
 416 ### Transformations and Simplicity First<a name="transformations_first"></a>
 417 The purpose of the Linalg IR and its operations is primarily to:
 418 - develop a set of key transformations, and
 419 - make them correct by construction by carefully curating the set of
 420 generic operation properties that drive applicability, and
 421 - make them very simple to implement, apply, verify and especially
 422 maintain.
 423
 424 The problem at hand is fundamentally driven by compilation of domain-specific
 425 workloads for high-performance and parallel hardware architectures: **this is
 426 an HPC compilation problem**.
 427
 428 The selection of relevant transformations follows a co-design approach and
 429 involves considerations related to:
 430 - concrete current and future needs of the application domain,
 431 - concrete current and future hardware properties and ISAs,
 432 - understanding of strengths and limitations of [existing approaches](#prior-art),
 433 - taking advantage of the coexistence of multiple levels of IR in MLIR,
 434
 435 One needs to be methodical to avoid proliferation and redundancy. A given
 436 transformation could exist at multiple levels of abstraction but **just
 437 because one can write transformation X at level Y absolutely does not mean
 438 one should**. This is where evaluation of existing
 439 systems and acknowledgement of their strengths and weaknesses is crucial:
 440 simplicity and maintainability aspects must be first-order concerns. Without
 441 this additional effort of introspection, a design will not stand the test of
 442 time. At the same time, complexity is very hard to ward off. It seems one needs
 443 to suffer complexity to be prompted to take a step back and rethink
 444 abstractions.
 445
 446 This is not merely a reimplementation of idea X in system Y: simplicity
 447 **must be the outcome** of this introspection effort.
 448
 449 ### Preservation of Information<a name="information_preservation"></a>
 450 The last two decades have seen a proliferation of Domain-Specific Languages
 451 (DSLs) that have been very successful at limited application domains.
 452 The main commonality between these systems is their use of a significantly
 453 richer structural information than CFGs or loops.
 454 Still, another commonality of existing systems is to lower to LLVM very quickly,
 455 and cross a wide abstraction gap in a single step. This process often drops
 456 semantic information that later needs to be reconstructed later,
 457 when it is not irremediably lost.
 458
 459 These remarks, coupled with MLIR's suitability for defining IR at multiple
 460 levels of abstraction led to the following 2 principles.
 461
 462 #### Declarative Specification: Avoid Raising<a name="declarative_specification"></a>
 463
 464 Compiler transformations need static structural information (e.g. loop-nests,
 465 graphs of basic blocks, pure functions, etc). When that structural information
 466 is lost, it needs to be reconstructed.
 467
 468 A good illustration of this phenomenon is the notion of *raising* in polyhedral
 469 compilers: multiple polyhedral tools start by raising from a simplified C
 470 form or from SSA IR into a higher-level representation that is more amenable
 471 to loop transformations.
 472
 473 In advanced polyhedral compilers, a second type of raising
 474 may typically exist to detect particular patterns (often variations of
 475 BLAS). Such patterns may be broken by transformations making their detection
 476 very fragile or even just impossible (incorrect).
 477
 478 MLIR makes it easy to define op semantics declaratively thanks to the use of
 479 regions and attributes. This is an ideal opportunity to define new abstractions
 480 to convey user-intent directly into the proper abstraction.
 481
 482 #### Progressive Lowering: Don't Lose Information too Quickly<a name="#progressive_lowering"></a>
 483
 484 Lowering too quickly to affine, generic loops or CFG form reduces the
 485 amount of structure available to derive transformations from. While
 486 manipulating loops is a net gain compared to CFG form for a certain class of
 487 transformations, important information is still lost (e.g. parallel loops, or
 488 mapping of a loop nest to an external implementation).
 489
 490 This creates non-trivial phase ordering issues. For instance, loop fusion may
 491 easily destroy the ability to detect a BLAS pattern. One possible alternative
 492 is to perform loop fusion, tiling, intra-tile loop distribution and then hope to
 493 detect the BLAS pattern. Such a scheme presents difficult phase-ordering
 494 constraints that will likely interfere with other decisions and passes.
 495 Instead, certain Linalg ops are designed to maintain high-level information
 496 across transformations such as tiling and fusion.
 497
 498 MLIR is designed as an infrastructure for ***progressive lowering***.
 499 Linalg fully embraces this notion and thinks of codegen in terms of
 500 *reducing a potential function*. That potential function is loosely
 501 defined in terms of number of low-level instructions in a particular
 502 Linalg ops (i.e. how heavy or lightweight the Linalg op is).
 503 Linalg-based codegen and transformations start from higher-level IR
 504 ops and dialects. Then each transformation application reduces the
 505 potential by introducing lower-level IR ops and *smaller* Linalg ops.
 506 This gradually reduces the potential, all the way to Loops + VectorOps
 507 and LLVMIR.
 508
 509 ### Composable and Declarative Transformations<a name="declarative_transformations"></a>
 510 Complex and impactful transformations need not be hard to manipulate, write or
 511 maintain. Mixing XLA-style high-level op semantics knowledge with generic
 512 properties to describe these semantics, directly in MLIR, is a promising way to:
 513 - Design transformations that are correct by construction, easy to
 514 write, easy to verify and easy to maintain.
 515 - Provide a way to specify transformations and the units of IR they manipulate
 516 declaratively. In turn this allows using local pattern rewrite rules in MLIR
 517 (i.e. [DRR](../DeclarativeRewrites.md)).
 518 - Allow creating customizable passes declaratively by simply selecting rewrite
 519 rules. This allows mixing transformations, canonicalizations, constant folding
 520 and other enabling rewrites in a single pass. The result is a system where pass
 521 fusion is very simple to obtain and gives hope for solving certain
 522 [phase ordering issues](https://dl.acm.org/doi/10.1145/201059.201061).
 523
 524 ### Suitability for Search and Machine Learning<a name="ml"></a>
 525 Compiler heuristics are hand-crafted human-engineered features: it is
 526 ripe for disruption by machine-learning  techniques.
 527 To enable search, compiler transformations should be fine-grained,
 528 [composable](#declarative_transformations) and expose tuning parameters that
 529 can modify their behavior, guided by lessons from previous experience
 530 with [Tensor Comprehensions](#lessonstc).
 531
 532 Of course, we are not advocating for using ML everywhere in the stack
 533 immediately: low-level compilation and machine models are still quite performant
 534 in LLVM. However, for the high-level and mid-level optimization problems,
 535 models need to be conditioned (probabilistically) on the low-level
 536 compiler which acts as a blackbox. For these reasons we prioritize the
 537 design of IR and transformations with search-friendly properties over
 538 building cost models.
 539 Still, this  does not mean Linalg refuses cost models: instead we
 540 prefer to invest in infrastructure that will enable [ML-based
 541 techniques to automatically build cost
 542 models](http://homepages.inf.ed.ac.uk/hleather/publications/2009_autofeatures_cgo.pdf).
 543
 544 ### Extensibility and Future-Proofness<a name="future"></a>
 545 MLIR allows defining IR for structured control flow and structured
 546 data types. We choose to take advantage of these properties for the
 547 reasons described above.
 548 In particular, the `MemRefType` represents dense non-contiguous memory regions.
 549 This structure should extend beyond simple dense data types and generalize to
 550 ragged, sparse and mixed dense/sparse tensors as well as to trees, hash tables,
 551 tables of records and maybe even graphs.
 552
 553 For such more advanced data types, the control-flow required to traverse the
 554 data structures, termination conditions, etc are much less simple to analyze and
 555 characterize statically. As a consequence we need to also design solutions that
 556 stand a chance of evolving into runtime-adaptive computations (e.g.
 557 inspector-executor in which an *inspector* runs a cheap runtime
 558 analysis on the data to configure how the *executor* should run).
 559 While there is no concrete solution
 560 today to solve these problems in MLIR, it is pretty clear that perfect
 561 static knowledge and analyses will not be serious contenders for these problems.
 562
 563 ## Key Observations<a name="keyobservation"></a>
 564 The following key observations have influenced the design of Linalg and helped
 565 reconcile [core guiding principles](#guiding_principles) with real-world
 566 requirements when producing an implementation based on MLIR.
 567
 568 ### Algorithms + Data Structures = Programs<a name="data_and_compute"></a>
 569
 570 This is a twist on Niklaus Wirth's formulation but captures the essence of the
 571 design of Linalg: control-flow does not exist in a vacuum, independently of
 572 data. On the contrary, there is a very strong relationship between control-flow
 573 and data structures: one cannot exist without the other. This has multiple
 574 implications on the
 575 [semantics of Linalg Ops](../Dialects/Linalg/_index.md/#linalg_ops) and their
 576 transformations. In particular, this observation influences whether certain
 577 transformations are better done: - as control flow or data structure
 578 manipulation, - on Linalg ops attributes or on loops after some partial lowering
 579 occurred, - as extensions to the Linalg dialect in terms of new ops or
 580 attributes.
 581
 582 ### The Dialect Need not be Closed Under Transformations<a name="dialect_not_closed"></a>
 583 This is probably the most surprising and counter-intuitive
 584 observation. When one designs IR for transformations, closed-ness is
 585 often a non-negotiable property.
 586 This is a key design principle of polyhedral IRs such as
 587 [URUK](http://icps.u-strasbg.fr/~bastoul/research/papers/GVBCPST06-IJPP.pdf)
 588 and
 589 [ISL-based IRs](https://en.wikipedia.org/wiki/Integer_set_library):
 590 they are closed under affine transformations.
 591 In MLIR, multiple dialects coexist and form a coherent whole. After
 592 experimenting with different alternatives, it became clear that strict
 593 dialect closed-ness wasn't necessary and could be relaxed. Previous
 594 systems did not have simple and principled means of building new IR
 595 and probably suffered from this limitation. We conjecture this is a
 596 key reason they required the IR to be closed under transformations.
 597
 598 Despite the fact that Linalg ops only allow perfectly nested
 599 semantics, once tiling and fusion kick in, imperfectly nested loops
 600 are gradually introduced.
 601 In other words, imperfectly nested control flow appears as ***the result of
 602 applying key transformations***.
 603
 604 Considering the *potential* described during the discussion on
 605 [Progressive Lowering](#progressive_lowering), closed-ness under
 606 transformation would dictate that the potential remains constant.
 607 In contrast, Linalg advocates for ***monotonicity*** under
 608 transformations.
 609
 610 ### Summary of Existing Alternatives a Picture<a name="observationssummary"></a>
 611 Lastly, we summarize our observations of lessons from [Prior
 612 Art](#prior-art)---when viewed under the lense of our [Core Guiding
 613 Principles](#guiding_principles)---with the following picture.
 614
 615 <img width="1200" alt="MLIR Codegen Flow"
 616 src="https://user-images.githubusercontent.com/10148468/73613904-2f720a00-45c8-11ea-8265-1c856c02525b.png">
 617
 618 This figure is not meant to be perfectly accurate but a rough map of how we view
 619 the distribution of structural information in existing systems, from a
 620 codegen-friendly angle. Unsurprisingly, the
 621 [Linalg Dialect](../Dialects/Linalg/_index.md) and its future evolutions aspire
 622 to a position in the top-right of this map.