mlir/docs/Dialects/Vector.md

   1 # 'vector' Dialect
   2
   3 [TOC]
   4
   5 MLIR supports multi-dimensional `vector` types and custom operations on those
   6 types. A generic, retargetable, higher-order `vector` type (`n-D` with `n > 1`)
   7 is a structured type, that carries semantic information useful for
   8 transformations. This document discusses retargetable abstractions that exist in
   9 MLIR today and operate on ssa-values of type `vector` along with pattern
  10 rewrites and lowerings that enable targeting specific instructions on concrete
  11 targets. These abstractions serve to separate concerns between operations on
  12 `memref` (a.k.a buffers) and operations on `vector` values. This is not a new
  13 proposal but rather a textual documentation of existing MLIR components along
  14 with a rationale.
  15
  16 ## Positioning in the Codegen Infrastructure
  17
  18 The following diagram, recently presented with the
  19 [StructuredOps abstractions](https://drive.google.com/corp/drive/u/0/folders/1sRAsgsd8Bvpm_IxREmZf2agsGU2KvrK-),
  20 captures the current codegen paths implemented in MLIR in the various existing
  21 lowering paths.
  22 ![](https://user-images.githubusercontent.com/10148468/71177417-f78e4d80-2239-11ea-92ef-700f42ea503f.png)
  23
  24 The following diagram seeks to isolate `vector` dialects from the complexity of
  25 the codegen paths and focus on the payload-carrying ops that operate on std and
  26 `vector` types. This diagram is not to be taken as set in stone and
  27 representative of what exists today but rather illustrates the layering of
  28 abstractions in MLIR.
  29
  30 ![`vector` Abstractions in MLIR](https://user-images.githubusercontent.com/10148468/71176949-e85ad000-2238-11ea-9806-200843bc4943.png)
  31
  32 This  separates concerns related to (a) defining efficient operations on
  33 `vector` types from (b) program analyses + transformations on `memref`, loops
  34 and other types of structured ops (be they `HLO`, `LHLO`, `Linalg` or other ).
  35 Looking a bit forward in time, we can put a stake in the ground and venture that
  36 the higher level of `vector`-level primitives we build and target from codegen
  37 (or some user/language level), the simpler our task will be, the more complex
  38 patterns can be expressed and the better performance will be.
  39
  40 ## Components of a Generic Retargetable Vector-Level Dialect
  41
  42 The existing MLIR `vector`-level dialects are related to the following bottom-up
  43 abstractions:
  44
  45 1.  Representation in `LLVMIR` via data structures, instructions and intrinsics.
  46     This is referred to as the `LLVM` level.
  47 2.  Set of machine-specific operations and types that are built to translate
  48     almost 1-1 with the HW ISA. This is referred to as the Hardware Vector
  49     level; a.k.a `HWV`. For instance, we have (a) the `NVVM` dialect (for
  50     `CUDA`) with tensor core ops, (b) accelerator-specific dialects (internal),
  51     a potential (future) `CPU` dialect to capture `LLVM` intrinsics more closely
  52     and other dialects for specific hardware. Ideally this should be
  53     auto-generated as much as possible from the `LLVM` level.
  54 3.  Set of virtual, machine-agnostic, operations that are informed by costs at
  55     the `HWV`-level. This is referred to as the Virtual Vector level; a.k.a
  56     `VV`. This is the level that higher-level abstractions (codegen, automatic
  57     vectorization, potential vector language, ...) targets.
  58
  59 The existing generic, retargetable, `vector`-level dialect is related to the
  60 following top-down rewrites and conversions:
  61
  62 1.  MLIR Rewrite Patterns applied by the MLIR `PatternRewrite` infrastructure to
  63     progressively lower to implementations that match closer and closer to the
  64     `HWV`. Some patterns are "in-dialect" `VV -> VV` and some are conversions
  65     `VV -> HWV`.
  66 2.  `Virtual Vector -> Hardware Vector` lowering is specified as a set of MLIR
  67     lowering patterns that are specified manually for now.
  68 3.  `Hardware Vector -> LLVM` lowering is a mechanical process that is written
  69     manually at the moment and that should be automated, following the `LLVM ->
  70     Hardware Vector` ops generation as closely as possible.
  71
  72 ## Short Description of the Existing Infrastructure
  73
  74 ### LLVM level
  75
  76 On CPU, the `n-D` `vector` type currently lowers to `!llvm<array<vector>>`.
  77 More concretely,
  78 * `vector<4x8x128xf32>` lowers to `!llvm<[4 x [ 8 x < 128
  79 x float >]]>` (fixed-width vector), and
  80 * `vector<4x8x[128]xf32>` lowers to `!llvm<[4 x [ 8 x < vscale x 128
  81 x float >]]>` (scalable vector).
  82
  83 There are tradeoffs involved related to how one can access subvectors and how
  84 one uses `llvm.extractelement`, `llvm.insertelement` and `llvm.shufflevector`.
  85 The section on [LLVM Lowering Tradeoffs](#llvm-lowering-tradeoffs) offers a
  86 deeper dive into the current design choices and tradeoffs.
  87
  88 Note, while LLVM supports arrarys of scalable vectors, these are required to be
  89 fixed-width arrays of 1-D scalable vectors. This means scalable vectors with a
  90 non-trailing scalable dimension (e.g. `vector<4x[8]x128xf32`) are not
  91 convertible to LLVM.
  92
  93 Finally, MLIR takes the same view on scalable Vectors as LLVM (c.f.
  94 [VectorType](https://llvm.org/docs/LangRef.html#vector-type)):
  95 > For scalable vectors, the total number of elements is a constant multiple
  96 > (called vscale) of the specified number of elements; vscale is a positive
  97 > integer that is unknown at compile time and the same hardware-dependent
  98 > constant for all scalable vectors at run time.  The size of a specific
  99 > scalable vector type is thus constant within IR, even if the exact size in
 100 > bytes cannot be determined until run time.
 101
 102 ### Hardware Vector Ops
 103
 104 Hardware Vector Ops are implemented as one dialect per target. For internal
 105 hardware, we are auto-generating the specific HW dialects. For `GPU`, the `NVVM`
 106 dialect adds operations such as `mma.sync`, `shfl` and tests. For `CPU` things
 107 are somewhat in-flight because the abstraction is close to `LLVMIR`. The jury is
 108 still out on  whether a generic `CPU` dialect is concretely needed, but it seems
 109 reasonable to have the same levels of abstraction for all targets and perform
 110 cost-based lowering decisions in MLIR even for `LLVM`. Specialized `CPU`
 111 dialects that would capture specific features not well captured by LLVM peephole
 112 optimizations of on different types that core MLIR supports (e.g. Scalable
 113 Vectors) are welcome future extensions.
 114
 115 ### Virtual Vector Ops
 116
 117 Some existing Arith and Vector Dialect on `n-D` `vector` types comprise:
 118
 119 ```mlir
 120 // Produces a vector<3x7x8xf32>
 121 %a = arith.addf %0, %1 : vector<3x7x8xf32>
 122 // Produces a vector<3x7x8xf32>
 123 %b = arith.mulf %0, %1 : vector<3x7x8xf32>
 124 // Produces a vector<3x7x8xf32>
 125 %c = vector.splat %1 : vector<3x7x8xf32>
 126
 127 %d = vector.extract %0[1]: vector<7x8xf32> from vector<3x7x8xf32>
 128 %e = vector.extract %0[1, 5]: vector<8xf32> from vector<3x7x8xf32>
 129 %f = vector.outerproduct %0, %1: vector<4xf32>, vector<8xf32>      // -> vector<4x8xf32>
 130 %g = vector.outerproduct %0, %1, %2: vector<4xf32>, vector<8xf32>  // fma when adding %2
 131
 132 // Returns a slice of type vector<2x2x16xf32>
 133 %h = vector.strided_slice %0
 134     {offsets = [2, 2], sizes = [2, 2], strides = [1, 1]}:
 135   vector<4x8x16xf32>
 136
 137 %i = vector.transfer_read %A[%0, %1]
 138     {permutation_map = (d0, d1) -> (d0)}:
 139   memref<7x?xf32>, vector<4xf32>
 140
 141 vector.transfer_write %f1, %A[%i0, %i1, %i2, %i3]
 142     {permutation_map = (d0, d1, d2, d3) -> (d3, d1, d0)} :
 143   vector<5x4x3xf32>, memref<?x?x?x?xf32>
 144 ```
 145
 146 The list of Vector is currently undergoing evolutions and is best kept track of
 147 by following the evolution of the
 148 [VectorOps.td](https://github.com/llvm/llvm-project/blob/main/mlir/include/mlir/Dialect/Vector/IR/VectorOps.td)
 149 ODS file (markdown documentation is automatically generated locally when
 150 building and populates the
 151 [Vector doc](https://github.com/llvm/llvm-project/blob/main/mlir/docs/Dialects/Vector.md)).
 152 Recent extensions are driven by concrete use cases of interest. A notable such
 153 use case is the `vector.contract` op which applies principles of the
 154 StructuredOps abstraction to `vector` types.
 155
 156 ### Virtual Vector Rewrite Patterns
 157
 158 The following rewrite patterns exist at the `VV->VV` level:
 159
 160 1.  The now retired `MaterializeVector` pass used to legalize ops on a
 161     coarse-grained virtual `vector` to a finer-grained virtual `vector` by
 162     unrolling. This has been rewritten as a retargetable unroll-and-jam pattern
 163     on `vector` ops and `vector` types.
 164 2.  The lowering of `vector_transfer` ops legalizes `vector` load/store ops to
 165     permuted loops over scalar load/stores. This should evolve to loops over
 166     `vector` load/stores + `mask` operations as they become available `vector`
 167     ops at the `VV` level.
 168
 169 The general direction is to add more Virtual Vector level ops and implement more
 170 useful `VV -> VV` rewrites as composable patterns that the PatternRewrite
 171 infrastructure can apply iteratively.
 172
 173 ### Virtual Vector to Hardware Vector Lowering
 174
 175 For now, `VV -> HWV` are specified in C++ (see for instance the
 176 [SplatOpLowering for n-D vectors](https://github.com/tensorflow/mlir/commit/0a0c4867c6a6fcb0a2f17ef26a791c1d551fe33d)
 177 or the
 178 [VectorOuterProductOp lowering](https://github.com/tensorflow/mlir/commit/957b1ca9680b4aacabb3a480fbc4ebd2506334b8)).
 179
 180 Simple
 181 [conversion tests](https://github.com/llvm/llvm-project/blob/main/mlir/test/Conversion/VectorToLLVM/vector-to-llvm.mlir)
 182 are available for the `LLVM` target starting from the Virtual Vector Level.
 183
 184 ## Rationale
 185
 186 ### Hardware as `vector` Machines of Minimum Granularity
 187
 188 Higher-dimensional `vector`s are ubiquitous in modern HPC hardware. One way to
 189 think about Generic Retargetable `vector`-Level Dialect is that it operates on
 190 `vector` types that are multiples of a "good" `vector` size so the HW can
 191 efficiently implement a set of high-level primitives (e.g.
 192 `vector<8x8x8x16xf32>` when HW `vector` size is say `vector<4x8xf32>`).
 193
 194 Some notable `vector` sizes of interest include:
 195
 196 1.  CPU: `vector<HW_vector_size * k>`, `vector<core_count * k’ x
 197     HW_vector_size * k>` and `vector<socket_count x core_count * k’ x
 198     HW_vector_size * k>`
 199 2.  GPU: `vector<warp_size * k>`, `vector<warp_size * k x float4>` and
 200     `vector<warp_size * k x 4 x 4 x 4>` for tensor_core sizes,
 201 3.  Other accelerators: n-D `vector` as first-class citizens in the HW.
 202
 203 Depending on the target, ops on sizes that are not multiples of the HW `vector`
 204 size may either produce slow code (e.g. by going through `LLVM` legalization) or
 205 may not legalize at all (e.g. some unsupported accelerator X combination of ops
 206 and types).
 207
 208 ### Transformations Problems Avoided
 209
 210 A `vector<16x32x64xf32>` virtual `vector` is a coarse-grained type that can be
 211 “unrolled” to HW-specific sizes. The multi-dimensional unrolling factors are
 212 carried in the IR by the `vector` type. After unrolling, traditional
 213 instruction-level scheduling can be run.
 214
 215 The following key transformations (along with the supporting analyses and
 216 structural constraints) are completely avoided by operating on a `vector`
 217 `ssa-value` abstraction:
 218
 219 1.  Loop unroll and unroll-and-jam.
 220 2.  Loop and load-store restructuring for register reuse.
 221 3.  Load to store forwarding and Mem2reg.
 222 4.  Coarsening (raising) from finer-grained `vector` form.
 223
 224 Note that “unrolling” in the context of `vector`s corresponds to partial loop
 225 unroll-and-jam and not full unrolling. As a consequence this is expected to
 226 compose with SW pipelining where applicable and does not result in ICache blow
 227 up.
 228
 229 ### The Big Out-Of-Scope Piece: Automatic Vectorization
 230
 231 One important piece not discussed here is automatic vectorization (automatically
 232 raising from scalar to n-D `vector` ops and types). The TL;DR is that when the
 233 first "super-vectorization" prototype was implemented, MLIR was nowhere near as
 234 mature as it is today. As we continue building more abstractions in `VV -> HWV`,
 235 there is an opportunity to revisit vectorization in MLIR.
 236
 237 Since this topic touches on codegen abstractions, it is technically out of the
 238 scope of this survey document but there is a lot to discuss in light of
 239 structured op type representations and how a vectorization transformation can be
 240 reused across dialects. In particular, MLIR allows the definition of dialects at
 241 arbitrary levels of granularity and lends itself favorably to progressive
 242 lowering. The argument can be made that automatic vectorization on a loops + ops
 243 abstraction is akin to raising structural information that has been lost.
 244 Instead, it is possible to revisit vectorization as simple pattern rewrites,
 245 provided the IR is in a suitable form. For instance, vectorizing a
 246 `linalg.generic` op whose semantics match a `matmul` can be done
 247 [quite easily with a pattern](https://github.com/tensorflow/mlir/commit/bff722d6b59ab99b998f0c2b9fccd0267d9f93b5).
 248 In fact this pattern is trivial to generalize to any type of contraction when
 249 targeting the `vector.contract` op, as well as to any field (`+/*`, `min/+`,
 250 `max/+`, `or/and`, `logsumexp/+` ...) . In other words, by operating on a higher
 251 level of generic abstractions than affine loops, non-trivial transformations
 252 become significantly simpler and composable at a finer granularity.
 253
 254 Irrespective of the existence of an auto-vectorizer, one can build a notional
 255 vector language based on the VectorOps dialect and build end-to-end models with
 256 expressing `vector`s in the IR directly and simple pattern-rewrites.
 257 [EDSC](https://github.com/llvm/llvm-project/blob/main/mlir/docs/EDSC.md)s
 258 provide a simple way of driving such a notional language directly in C++.
 259
 260 ## Bikeshed Naming Discussion
 261
 262 There are arguments against naming an n-D level of abstraction `vector` because
 263 most people associate it with 1-D `vector`s. On the other hand, `vector`s are
 264 first-class n-D values in MLIR. The alternative name Tile has been proposed,
 265 which conveys higher-D meaning. But it also is one of the most overloaded terms
 266 in compilers and hardware. For now, we generally use the `n-D` `vector` name and
 267 are open to better suggestions.
 268
 269 ## 0D Vectors
 270
 271 Vectors of dimension 0 (or _0-D vectors_ or _0D vectors_) are allowed inside
 272 MLIR. For instance, a `f32` vector containing one scalar can be denoted as
 273 `vector<f32>`. This is similar to the `tensor<f32>` type that is available in
 274 TensorFlow or the `memref<f32>` type that is available in MLIR.
 275
 276 Generally, a 0D `vector` can be interpreted as a scalar. The benefit of 0D
 277 `vector`s, `tensor`s, and `memref`s is that they make it easier to lower code
 278 from various frontends such as TensorFlow and make it easier to handle corner
 279 cases such as unrolling a loop from 1D to 0D.
 280
 281 ## LLVM Lowering Tradeoffs
 282
 283 This section describes the tradeoffs involved in lowering the MLIR n-D vector
 284 type and operations on it to LLVM-IR. Putting aside the
 285 [LLVM Matrix](http://lists.llvm.org/pipermail/llvm-dev/2018-October/126871.html)
 286 proposal for now, this assumes LLVM only has built-in support for 1-D vector.
 287 The relationship with the LLVM Matrix proposal is discussed at the end of this
 288 document.
 289
 290 LLVM instructions are prefixed by the `llvm.` dialect prefix (e.g.
 291 `llvm.insertvalue`). Such ops operate exclusively on 1-D vectors and aggregates
 292 following the [LLVM LangRef](https://llvm.org/docs/LangRef.html). MLIR
 293 operations are prefixed by the `vector.` dialect prefix (e.g.
 294 `vector.insertelement`). Such ops operate exclusively on MLIR `n-D` `vector`
 295 types.
 296
 297 ### Alternatives For Lowering an n-D Vector Type to LLVM
 298
 299 Consider a vector of rank n with static sizes `{s_0, ... s_{n-1}}` (i.e. an MLIR
 300 `vector<s_0x...s_{n-1}xf32>`). Lowering such an `n-D` MLIR vector type to an
 301 LLVM descriptor can be done by either:
 302
 303 1.  Nested aggregate type of `1-D` vector:
 304     `!llvm."[s_0x[s_1x[...<s_{n-1}xf32>]]]">` in the MLIR LLVM dialect (current
 305     lowering in MLIR).
 306 2.  Flattening to a `1-D` vector: `!llvm<"(s_0*...*s_{n-1})xfloat">` in the MLIR
 307     LLVM dialect.
 308 3.  A mix of both.
 309
 310 There are multiple tradeoffs involved in choosing one or the other that we
 311 discuss. It is important to note that “a mix of both” immediately reduces to
 312 “nested aggregate type of 1-D vector” with a `vector.cast %0:
 313 vector<4x8x16x32xf32> to vector<4x4096xf32>` operation, that flattens the most
 314 "k" minor dimensions.
 315
 316 ### Constraints Inherited from LLVM (see LangRef)
 317
 318 The first constraint was already mentioned: LLVM only supports `1-D` `vector`
 319 types natively. Additional constraints are related to the difference in LLVM
 320 between vector and
 321 [aggregate types](https://llvm.org/docs/LangRef.html#aggregate-types):
 322 > Aggregate Types are a subset of derived types that can contain multiple
 323 > member types. Arrays and structs are aggregate types. Vectors are not
 324 > considered to be aggregate types.
 325
 326 This distinction is also reflected in some of the operations. For `1-D` vectors,
 327 the operations `llvm.extractelement`, `llvm.insertelement`, and
 328 `llvm.shufflevector` apply, with direct support for dynamic indices. For `n-D`
 329 vectors with `n>1`, and thus aggregate types at LLVM level, the more restrictive
 330 operations `llvm.extractvalue` and `llvm.insertvalue` apply, which only accept
 331 static indices. There is no direct shuffling support for aggregate types.
 332
 333 The next sentence (cf. LangRef [structure
 334 type](https://llvm.org/docs/LangRef.html#structure-type)) illustrates a
 335 recurrent tradeoff, also found in MLIR, between
 336 “value types” (subject to SSA use-def chains) and “memory types” (subject to
 337 aliasing and side-effects):
 338 > Structures in memory are accessed using ‘load’ and ‘store’ by getting a
 339 > pointer to a field with the llvm.getelementptr instruction. Structures in
 340 > registers are accessed using the llvm.extractvalue and llvm.insertvalue
 341 > instructions.
 342
 343 When transposing this to MLIR, `llvm.getelementptr` works on pointers to `n-D`
 344 vectors in memory. For `n-D`, vectors values that live in registers we can use
 345 `vector.extract` and `vector.insert` which do not accept dynamic indices. Note
 346 that this is consistent with hardware considerations as discussed below.
 347
 348 An alternative is to use an LLVM `1-D` `vector` type for which one can use
 349 `llvm.extractelement`, `llvm.insertelement` and `llvm.shufflevector`. These
 350 operations accept dynamic indices. The implication is that one has to use a
 351 flattened lowering of an MLIR n-D vector to an LLVM 1-D vector.
 352
 353 There are multiple tradeoffs involved that mix implications on the programming
 354 model, execution on actual HW and what is visible or hidden from codegen. They
 355 are discussed in the following sections.
 356
 357 ### Nested Aggregate
 358
 359 Pros:
 360
 361 1.  Natural encoding n-D vector -> (n-1)-D aggregate over 1-D vector.
 362 2.  No need for linearization / delinearization logic inserted everywhere.
 363 3.  `llvm.insertvalue`, `llvm.extractvalue` of `(n-k)-D` aggregate is natural.
 364 4.  `llvm.insertelement`, `llvm.extractelement`, `llvm.shufflevector` over `1-D`
 365     vector type is natural.
 366
 367 Cons:
 368
 369 1.  `llvm.insertvalue` / `llvm.extractvalue` does not accept dynamic indices but
 370     only static ones.
 371 2.  Dynamic indexing on the non-most-minor dimension requires roundtrips to
 372     memory.
 373 3.  Special intrinsics and native instructions in LLVM operate on `1-D` vectors.
 374     This is not expected to be a practical limitation thanks to a `vector.cast
 375     %0: vector<4x8x16x32xf32> to vector<4x4096xf32>` operation, that flattens
 376     the most minor dimensions (see the bigger picture in implications on
 377     codegen).
 378
 379 ### Flattened 1-D Vector Type
 380
 381 Pros:
 382
 383 1.  `insertelement` / `extractelement` / `shufflevector` with dynamic indexing
 384     is possible over the whole lowered `n-D` vector type.
 385 2.  Supports special intrinsics and native operations.
 386
 387 Cons:
 388
 389 1.  Requires linearization/delinearization logic everywhere, translations are
 390     complex.
 391 2.  Hides away the real HW structure behind dynamic indexing: at the end of the
 392     day, HW vector sizes are generally fixed and multiple vectors will be needed
 393     to hold a vector that is larger than the HW.
 394 3.  Unlikely peephole optimizations will result in good code: arbitrary dynamic
 395     accesses, especially at HW vector boundaries unlikely to result in regular
 396     patterns.
 397
 398 ### Discussion
 399
 400 #### HW Vectors and Implications on the SW and the Programming Model
 401
 402 As of today, the LLVM model only support `1-D` vector types. This is
 403 unsurprising because historically, the vast majority of HW only supports `1-D`
 404 vector registers. We note that multiple HW vendors are in the process of
 405 evolving to higher-dimensional physical vectors.
 406
 407 In the following discussion, let's assume the HW vector size is `1-D` and the SW
 408 vector size is `n-D`, with `n >= 1`. The same discussion would apply with `2-D`
 409 HW `vector` size and `n >= 2`. In this context, most HW exhibit a vector
 410 register file. The number of such vectors is fixed. Depending on the rank and
 411 sizes of the SW vector abstraction and the HW vector sizes and number of
 412 registers, an `n-D` SW vector type may be materialized by a mix of multiple
 413 `1-D` HW vector registers + memory locations at a given point in time.
 414
 415 The implication of the physical HW constraints on the programming model are that
 416 one cannot index dynamically across hardware registers: a register file can
 417 generally not be indexed dynamically. This is because the register number is
 418 fixed and one either needs to unroll explicitly to obtain fixed register numbers
 419 or go through memory. This is a constraint familiar to CUDA programmers: when
 420 declaring a `private float a[4]`; and subsequently indexing with a *dynamic*
 421 value results in so-called **local memory** usage (i.e. roundtripping to
 422 memory).
 423
 424 #### Implication on codegen
 425
 426 MLIR `n-D` vector types are currently represented as `(n-1)-D` arrays of `1-D`
 427 vectors when lowered to LLVM. This introduces the consequences on static vs
 428 dynamic indexing discussed previously: `extractelement`, `insertelement` and
 429 `shufflevector` on `n-D` vectors in MLIR only support static indices. Dynamic
 430 indices are only supported on the most minor `1-D` vector but not the outer
 431 `(n-1)-D`. For other cases, explicit load / stores are required.
 432
 433 The implications on codegen are as follows:
 434
 435 1.  Loops around `vector` values are indirect addressing of vector values, they
 436     must operate on explicit load / store operations over `n-D` vector types.
 437 2.  Once an `n-D` `vector` type is loaded into an SSA value (that may or may not
 438     live in `n` registers, with or without spilling, when eventually lowered),
 439     it may be unrolled to smaller `k-D` `vector` types and operations that
 440     correspond to the HW. This level of MLIR codegen is related to register
 441     allocation and spilling that occur much later in the LLVM pipeline.
 442 3.  HW may support >1-D vectors with intrinsics for indirect addressing within
 443     these vectors. These can be targeted thanks to explicit `vector_cast`
 444     operations from MLIR `k-D` vector types and operations to LLVM `1-D`
 445     vectors + intrinsics.
 446
 447 Alternatively, we argue that directly lowering to a linearized abstraction hides
 448 away the codegen complexities related to memory accesses by giving a false
 449 impression of magical dynamic indexing across registers. Instead we prefer to
 450 make those very explicit in MLIR and allow codegen to explore tradeoffs.
 451 Different HW will require different tradeoffs in the sizes involved in steps 1.,
 452 2. and 3.
 453
 454 Decisions made at the MLIR level will have implications at a much later stage in
 455 LLVM (after register allocation). We do not envision to expose concerns related
 456 to modeling of register allocation and spilling to MLIR explicitly. Instead,
 457 each target will expose a set of "good" target operations and `n-D` vector
 458 types, associated with costs that `PatterRewriters` at the MLIR level will be
 459 able to target. Such costs at the MLIR level will be abstract and used for
 460 ranking, not for accurate performance modeling. In the future such costs will be
 461 learned.
 462
 463 #### Implication on Lowering to Accelerators
 464
 465 To target accelerators that support higher dimensional vectors natively, we can
 466 start from either `1-D` or `n-D` vectors in MLIR and use `vector.cast` to
 467 flatten the most minor dimensions to `1-D` `vector<Kxf32>` where `K` is an
 468 appropriate constant. Then, the existing lowering to LLVM-IR immediately
 469 applies, with extensions for accelerator-specific intrinsics.
 470
 471 It is the role of an Accelerator-specific vector dialect (see codegen flow in
 472 the figure above) to lower the `vector.cast`. Accelerator -> LLVM lowering would
 473 then consist of a bunch of `Accelerator -> Accelerator` rewrites to perform the
 474 casts composed with `Accelerator -> LLVM` conversions + intrinsics that operate
 475 on `1-D` `vector<Kxf32>`.
 476
 477 Some of those rewrites may need extra handling, especially if a reduction is
 478 involved. For example, `vector.cast %0: vector<K1x...xKnxf32> to vector<Kxf32>`
 479 when `K != K1 * … * Kn` and some arbitrary irregular `vector.cast %0:
 480 vector<4x4x17xf32> to vector<Kxf32>` may introduce masking and intra-vector
 481 shuffling that may not be worthwhile or even feasible, i.e. infinite cost.
 482
 483 However `vector.cast %0: vector<K1x...xKnxf32> to vector<Kxf32>` when `K = K1 *
 484 … * Kn` should be close to a noop.
 485
 486 As we start building accelerator-specific abstractions, we hope to achieve
 487 retargetable codegen: the same infra is used for CPU, GPU and accelerators with
 488 extra MLIR patterns and costs.
 489
 490 #### Implication on calling external functions that operate on vectors
 491
 492 It is possible (likely) that we additionally need to linearize when calling an
 493 external function.
 494
 495 ### Relationship to LLVM matrix type proposal.
 496
 497 The LLVM matrix proposal was formulated 1 year ago but seemed to be somewhat
 498 stalled until recently. In its current form, it is limited to 2-D matrix types
 499 and operations are implemented with LLVM intrinsics. In contrast, MLIR sits at a
 500 higher level of abstraction and allows the lowering of generic operations on
 501 generic n-D vector types from MLIR to aggregates of 1-D LLVM vectors. In the
 502 future, it could make sense to lower to the LLVM matrix abstraction also for CPU
 503 even though MLIR will continue needing higher level abstractions.
 504
 505 On the other hand, one should note that as MLIR is moving to LLVM, this document
 506 could become the unifying abstraction that people should target for 1-D vectors
 507 and the LLVM matrix proposal can be viewed as a subset of this work.
 508
 509 ### Conclusion
 510
 511 The flattened 1-D vector design in the LLVM matrix proposal is good in a
 512 HW-specific world with special intrinsics. This is a good abstraction for
 513 register allocation, Instruction-Level-Parallelism and
 514 SoftWare-Pipelining/Modulo Scheduling optimizations at the register level.
 515 However MLIR codegen operates at a higher level of abstraction where we want to
 516 target operations on coarser-grained vectors than the HW size and on which
 517 unroll-and-jam is applied and patterns across multiple HW vectors can be
 518 matched.
 519
 520 This makes “nested aggregate type of 1-D vector” an appealing abstraction for
 521 lowering from MLIR because:
 522
 523 1.  it does not hide complexity related to the buffer vs value semantics and the
 524     memory subsystem and
 525 2.  it does not rely on LLVM to magically make all the things work from a too
 526     low-level abstraction.
 527
 528 The use of special intrinsics in a `1-D` LLVM world is still available thanks to
 529 an explicit `vector.cast` op.
 530
 531 ## Operations
 532
 533 [include "Dialects/VectorOps.md"]