mlir/docs/BytecodeFormat.md

   1 # MLIR Bytecode Format
   2
   3 This documents describes the MLIR bytecode format and its encoding.
   4 This format is versioned and stable: we don't plan to ever break
   5 compatibility, that is a dialect should be able to deserialize and
   6 older bytecode. Similarly, we support back-deployment we an older
   7 version of the format can be targetted.
   8
   9 That said, it is important to realize that the promises of the
  10 bytecode format are made assuming immutable dialects: the format
  11 allows backward and forward compatibility, but only when nothing
  12 in a dialect changes (operations, types, attributes definitions).
  13
  14 A dialect can opt-in to handle its own versioning through the
  15 `BytecodeDialectInterface`. Some hooks are exposed to the dialect
  16 to allow managing a version encoded into the bytecode file. The
  17 version is loaded lazily and allows to retrieve the version
  18 information while decoding the input IR, and gives an opportunity
  19 to each dialect for which a version is present to perform IR
  20 upgrades post-parsing through the `upgradeFromVersion` method.
  21 There is no restriction on what kind of information a dialect
  22 is allowed to encode to model its versioning
  23
  24 [TOC]
  25
  26 ## Magic Number
  27
  28 MLIR uses the following four-byte magic number to
  29 indicate bytecode files:
  30
  31 '\[‘M’<sub>8</sub>, ‘L’<sub>8</sub>, ‘ï’<sub>8</sub>, ‘R’<sub>8</sub>\]'
  32
  33 In hex:
  34
  35 '\[‘4D’<sub>8</sub>, ‘4C’<sub>8</sub>, ‘EF’<sub>8</sub>, ‘52’<sub>8</sub>\]'
  36
  37 ## Format Overview
  38
  39 An MLIR Bytecode file is comprised of a byte stream, with a few simple
  40 structural concepts layered on top.
  41
  42 ### Primitives
  43
  44 #### Fixed-Width Integers
  45
  46 ```
  47   byte ::= `0x00`...`0xFF`
  48 ```
  49
  50 Fixed width integers are unsigned integers of a known byte size. The values are
  51 stored in little-endian byte order.
  52
  53 TODO: Add larger fixed width integers as necessary.
  54
  55 #### Variable-Width Integers
  56
  57 Variable width integers, or `VarInt`s, provide a compact representation for
  58 integers. Each encoded VarInt consists of one to nine bytes, which together
  59 represent a single 64-bit value. The MLIR bytecode utilizes the "PrefixVarInt"
  60 encoding for VarInts. This encoding is a variant of the
  61 [LEB128 ("Little-Endian Base 128")](https://en.wikipedia.org/wiki/LEB128)
  62 encoding, where each byte of the encoding provides up to 7 bits for the value,
  63 with the remaining bit used to store a tag indicating the number of bytes used
  64 for the encoding. This means that small unsigned integers (less than 2^7) may be
  65 stored in one byte, unsigned integers up to 2^14 may be stored in two bytes,
  66 etc.
  67
  68 The first byte of the encoding includes a length prefix in the low bits. This
  69 prefix is a bit sequence of '0's followed by a terminal '1', or the end of the
  70 byte. The number of '0' bits indicate the number of _additional_ bytes, not
  71 including the prefix byte, used to encode the value. All of the remaining bits
  72 in the first byte, along with all of the bits in the additional bytes, provide
  73 the value of the integer. Below are the various possible encodings of the prefix
  74 byte:
  75
  76 ```
  77 xxxxxxx1:  7 value bits, the encoding uses 1 byte
  78 xxxxxx10: 14 value bits, the encoding uses 2 bytes
  79 xxxxx100: 21 value bits, the encoding uses 3 bytes
  80 xxxx1000: 28 value bits, the encoding uses 4 bytes
  81 xxx10000: 35 value bits, the encoding uses 5 bytes
  82 xx100000: 42 value bits, the encoding uses 6 bytes
  83 x1000000: 49 value bits, the encoding uses 7 bytes
  84 10000000: 56 value bits, the encoding uses 8 bytes
  85 00000000: 64 value bits, the encoding uses 9 bytes
  86 ```
  87
  88 ##### Signed Variable-Width Integers
  89
  90 Signed variable width integer values are encoded in a similar fashion to
  91 [varints](#variable-width-integers), but employ
  92 [zigzag encoding](https://en.wikipedia.org/wiki/Variable-length_quantity#Zigzag_encoding).
  93 This encoding uses the low bit of the value to indicate the sign, which allows
  94 for more efficiently encoding negative numbers. If a negative value were encoded
  95 using a normal [varint](#variable-width-integers), it would be treated as an
  96 extremely large unsigned value. Using zigzag encoding allows for a smaller
  97 number of active bits in the value, leading to a smaller encoding. Below is the
  98 basic computation for generating a zigzag encoding:
  99
 100 ```
 101 (value << 1) ^ (value >> 63)
 102 ```
 103
 104 #### Strings
 105
 106 Strings are blobs of characters with an associated length.
 107
 108 ### Sections
 109
 110 ```
 111 section {
 112   idAndIsAligned: byte // id | (hasAlign << 7)
 113   length: varint,
 114
 115   alignment: varint?,
 116   padding: byte[], // Padding bytes are always `0xCB`.
 117
 118   data: byte[]
 119 }
 120 ```
 121
 122 Sections are a mechanism for grouping data within the bytecode. They enable
 123 delayed processing, which is useful for out-of-order processing of data,
 124 lazy-loading, and more. Each section contains a Section ID, whose high bit
 125 indicates if the section has alignment requirements, a length (which allows for
 126 skipping over the section), and an optional alignment. When an alignment is
 127 present, a variable number of padding bytes (0xCB) may appear before the section
 128 data. The alignment of a section must be a power of 2.
 129
 130 ## MLIR Encoding
 131
 132 Given the generic structure of MLIR, the bytecode encoding is actually fairly
 133 simplistic. It effectively maps to the core components of MLIR.
 134
 135 ### Top Level Structure
 136
 137 The top-level structure of the bytecode contains the 4-byte "magic number", a
 138 version number, a null-terminated producer string, and a list of sections. Each
 139 section is currently only expected to appear once within a bytecode file.
 140
 141 ```
 142 bytecode {
 143   magic: "MLïR",
 144   version: varint,
 145   producer: string,
 146   sections: section[]
 147 }
 148 ```
 149
 150 ### String Section
 151
 152 ```
 153 strings {
 154   numStrings: varint,
 155   reverseStringLengths: varint[],
 156   stringData: byte[]
 157 }
 158 ```
 159
 160 The string section contains a table of strings referenced within the bytecode,
 161 more easily enabling string sharing. This section is encoded first with the
 162 total number of strings, followed by the sizes of each of the individual strings
 163 in reverse order. The remaining encoding contains a single blob containing all
 164 of the strings concatenated together.
 165
 166 ### Dialect Section
 167
 168 The dialect section of the bytecode contains all of the dialects referenced
 169 within the encoded IR, and some information about the components of those
 170 dialects that were also referenced.
 171
 172 ```
 173 dialect_section {
 174   numDialects: varint,
 175   dialectNames: varint[],
 176   numTotalOpNames: varint,
 177   opNames: op_name_group[]
 178 }
 179
 180 op_name_group {
 181   dialect: varint // (dialectID << 1) | (hasVersion),
 182   version : dialect_version_section
 183   numOpNames: varint,
 184   opNames: varint[]
 185 }
 186
 187 dialect_version_section {
 188   size: varint,
 189   version: byte[]
 190 }
 191
 192 ```
 193
 194 Dialects are encoded as a `varint` containing the index to the name string
 195 within the string section, plus a flag indicating whether the dialect is
 196 versioned. Operation names are encoded in groups by dialect, with each group
 197 containing the dialect, the number of operation names, and the array of indexes
 198 to each name within the string section. The version is encoded as a nested
 199 section.
 200
 201 ### Attribute/Type Sections
 202
 203 Attributes and types are encoded using two [sections](#sections), one section
 204 (`attr_type_section`) containing the actual encoded representation, and another
 205 section (`attr_type_offset_section`) containing the offsets of each encoded
 206 attribute/type into the previous section. This structure allows for attributes
 207 and types to always be lazily loaded on demand.
 208
 209 ```
 210 attr_type_section {
 211   attrs: attribute[],
 212   types: type[]
 213 }
 214 attr_type_offset_section {
 215   numAttrs: varint,
 216   numTypes: varint,
 217   offsets: attr_type_offset_group[]
 218 }
 219
 220 attr_type_offset_group {
 221   dialect: varint,
 222   numElements: varint,
 223   offsets: varint[] // (offset << 1) | (hasCustomEncoding)
 224 }
 225
 226 attribute {
 227   encoding: ...
 228 }
 229 type {
 230   encoding: ...
 231 }
 232 ```
 233
 234 Each `offset` in the `attr_type_offset_section` above is the size of the
 235 encoding for the attribute or type and a flag indicating if the encoding uses
 236 the textual assembly format, or a custom bytecode encoding. We avoid using the
 237 direct offset into the `attr_type_section`, as a smaller relative offsets
 238 provides more effective compression. Attributes and types are grouped by
 239 dialect, with each `attr_type_offset_group` in the offset section containing the
 240 corresponding parent dialect, number of elements, and offsets for each element
 241 within the group.
 242
 243 #### Attribute/Type Encodings
 244
 245 In the abstract, an attribute/type is encoded in one of two possible ways: via
 246 its assembly format, or via a custom dialect defined encoding.
 247
 248 ##### Assembly Format Fallback
 249
 250 In the case where a dialect does not define a method for encoding the attribute
 251 or type, the textual assembly format of that attribute or type is used as a
 252 fallback. For example, a type of `!bytecode.type` would be encoded as the null
 253 terminated string "!bytecode.type". This ensures that every attribute and type
 254 may be encoded, even if the owning dialect has not yet opted in to a more
 255 efficient serialization.
 256
 257 TODO: We shouldn't redundantly encode the dialect name here, we should use a
 258 reference to the parent dialect instead.
 259
 260 ##### Dialect Defined Encoding
 261
 262 In addition to the assembly format fallback, dialects may also provide a custom
 263 encoding for their attributes and types. Custom encodings are very beneficial in
 264 that they are significantly smaller and faster to read and write.
 265
 266 Dialects can opt-in to providing custom encodings by implementing the
 267 `BytecodeDialectInterface`. This interface provides hooks, namely
 268 `readAttribute`/`readType` and `writeAttribute`/`writeType`, that will be used
 269 by the bytecode reader and writer. These hooks are provided a reader and writer
 270 implementation that can be used to encode various constructs in the underlying
 271 bytecode format. A unique feature of this interface is that dialects may choose
 272 to only encode a subset of their attributes and types in a custom bytecode
 273 format, which can simplify adding new or experimental components that aren't
 274 fully baked.
 275
 276 When implementing the bytecode interface, dialects are responsible for all
 277 aspects of the encoding. This includes the indicator for which kind of attribute
 278 or type is being encoded; the bytecode reader will only know that it has
 279 encountered an attribute or type of a given dialect, it doesn't encode any
 280 further information. As such, a common encoding idiom is to use a leading
 281 `varint` code to indicate how the attribute or type was encoded.
 282
 283 ### Resource Section
 284
 285 Resources are encoded using two [sections](#sections), one section
 286 (`resource_section`) containing the actual encoded representation, and another
 287 section (`resource_offset_section`) containing the offsets of each encoded
 288 resource into the previous section.
 289
 290 ```
 291 resource_section {
 292   resources: resource[]
 293 }
 294 resource {
 295   value: resource_bool | resource_string | resource_blob
 296 }
 297 resource_bool {
 298   value: byte
 299 }
 300 resource_string {
 301   value: varint
 302 }
 303 resource_blob {
 304   alignment: varint,
 305   size: varint,
 306   padding: byte[],
 307   blob: byte[]
 308 }
 309
 310 resource_offset_section {
 311   numExternalResourceGroups: varint,
 312   resourceGroups: resource_group[]
 313 }
 314 resource_group {
 315   key: varint,
 316   numResources: varint,
 317   resources: resource_info[]
 318 }
 319 resource_info {
 320   key: varint,
 321   size: varint
 322   kind: byte,
 323 }
 324 ```
 325
 326 Resources are grouped by the provider, either an external entity or a dialect,
 327 with each `resource_group` in the offset section containing the corresponding
 328 provider, number of elements, and info for each element within the group. For
 329 each element, we record the key, the value kind, and the encoded size. We avoid
 330 using the direct offset into the `resource_section`, as a smaller relative
 331 offsets provides more effective compression.
 332
 333 ### IR Section
 334
 335 The IR section contains the encoded form of operations within the bytecode.
 336
 337 ```
 338 ir_section {
 339   block: block; // Single block without arguments.
 340 }
 341 ```
 342
 343 #### Operation Encoding
 344
 345 ```
 346 op {
 347   name: varint,
 348   encodingMask: byte,
 349   location: varint,
 350
 351   attrDict: varint?,
 352
 353   numResults: varint?,
 354   resultTypes: varint[],
 355
 356   numOperands: varint?,
 357   operands: varint[],
 358
 359   numSuccessors: varint?,
 360   successors: varint[],
 361
 362   numUseListOrders: varint?,
 363   useListOrders: uselist[],
 364
 365   regionEncoding: varint?, // (numRegions << 1) | (isIsolatedFromAbove)
 366
 367   // regions are stored in a section if isIsolatedFromAbove
 368   regions: (region | region_section)[]
 369 }
 370
 371 uselist {
 372   indexInRange: varint?,
 373   useListEncoding: varint, // (numIndices << 1) | (isIndexPairEncoding)
 374   indices: varint[]
 375 }
 376 ```
 377
 378 The encoding of an operation is important because this is generally the most
 379 commonly appearing structure in the bytecode. A single encoding is used for
 380 every type of operation. Given this prevelance, many of the fields of an
 381 operation are optional. The `encodingMask` field is a bitmask which indicates
 382 which of the components of the operation are present.
 383
 384 ##### Location
 385
 386 The location is encoded as the index of the location within the attribute table.
 387
 388 ##### Attributes
 389
 390 If the operation has attribues, the index of the operation attribute dictionary
 391 within the attribute table is encoded.
 392
 393 ##### Results
 394
 395 If the operation has results, the number of results and the indexes of the
 396 result types within the type table are encoded.
 397
 398 ##### Operands
 399
 400 If the operation has operands, the number of operands and the value index of
 401 each operand is encoded. This value index is the relative ordering of the
 402 definition of that value from the start of the first ancestor isolated region.
 403
 404 ##### Successors
 405
 406 If the operation has successors, the number of successors and the indexes of the
 407 successor blocks within the parent region are encoded.
 408
 409 ##### Use-list orders
 410
 411 The reference use-list order is assumed to be the reverse of the global
 412 enumeration of all the op operands that one would obtain with a pre-order walk
 413 of the IR. This order is naturally obtained by building blocks of operations
 414 op-by-op. However, some transformations may shuffle the use-lists with respect
 415 to this reference ordering. If any of the results of the operation have a
 416 use-list order that is not sorted with respect to the reference use-list order,
 417 an encoding is emitted such that it is possible to reconstruct such order after
 418 parsing the bytecode. The encoding represents an index map from the reference
 419 operand order to the current use-list order. A bit flag is used to detect if
 420 this encoding is of type index-pair or not. When the bit flag is set to zero,
 421 the element at `i` represent the position of the use `i` of the reference list
 422 into the current use-list. When the bit flag is set to `1`, the encoding
 423 represent index pairs `(i, j)`, which indicate that the use at position `i` of
 424 the reference list is mapped to position `j` in the current use-list. When only
 425 less than half of the elements in the current use-list are shuffled with respect
 426 to the reference use-list, the index-pair encoding is used to reduce the
 427 bytecode memory requirements.
 428
 429 ##### Regions
 430
 431 If the operation has regions, the number of regions and if the regions are
 432 isolated from above are encoded together in a single varint. Afterwards, each
 433 region is encoded inline.
 434
 435 #### Region Encoding
 436
 437 ```
 438 region {
 439   numBlocks: varint,
 440
 441   numValues: varint?,
 442   blocks: block[]
 443 }
 444 ```
 445
 446 A region is encoded first with the number of blocks within. If the region is
 447 non-empty, the number of values defined directly within the region are encoded,
 448 followed by the blocks of the region.
 449
 450 #### Block Encoding
 451
 452 ```
 453 block {
 454   encoding: varint, // (numOps << 1) | (hasBlockArgs)
 455   arguments: block_arguments?, // Optional based on encoding
 456   ops : op[]
 457 }
 458
 459 block_arguments {
 460   numArgs: varint?,
 461   args: block_argument[]
 462   numUseListOrders: varint?,
 463   useListOrders: uselist[],
 464 }
 465
 466 block_argument {
 467   typeAndLocation: varint, // (type << 1) | (hasLocation)
 468   location: varint? // Optional, else unknown location
 469 }
 470 ```
 471
 472 A block is encoded with an array of operations and block arguments. The first
 473 field is an encoding that combines the number of operations in the block, with a
 474 flag indicating if the block has arguments.
 475
 476 Use-list orders are attached to block arguments similarly to how they are
 477 attached to operation results.