mlir/docs/BytecodeFormat.md

   1 # MLIR Bytecode Format
   2
   3 This document describes the MLIR bytecode format and its encoding.
   4 This format is versioned and stable: we don't plan to ever break
   5 compatibility, that is a dialect should be able to deserialize any
   6 older bytecode. Similarly, we support back-deployment so that an
   7 older version of the format can be targetted.
   8
   9 That said, it is important to realize that the promises of the
  10 bytecode format are made assuming immutable dialects: the format
  11 allows backward and forward compatibility, but only when nothing
  12 in a dialect changes (operations, types, attributes definitions).
  13
  14 A dialect can opt-in to handle its own versioning through the
  15 `BytecodeDialectInterface`. Some hooks are exposed to the dialect
  16 to allow managing a version encoded into the bytecode file. The
  17 version is loaded lazily and allows to retrieve the version
  18 information while decoding the input IR, and gives an opportunity
  19 to each dialect for which a version is present to perform IR
  20 upgrades post-parsing through the `upgradeFromVersion` method.
  21 There is no restriction on what kind of information a dialect
  22 is allowed to encode to model its versioning.
  23
  24 [TOC]
  25
  26 ## Magic Number
  27
  28 MLIR uses the following four-byte magic number to
  29 indicate bytecode files:
  30
  31 '\[‘M’<sub>8</sub>, ‘L’<sub>8</sub>, ‘ï’<sub>8</sub>, ‘R’<sub>8</sub>\]'
  32
  33 In hex:
  34
  35 '\[‘4D’<sub>8</sub>, ‘4C’<sub>8</sub>, ‘EF’<sub>8</sub>, ‘52’<sub>8</sub>\]'
  36
  37 ## Format Overview
  38
  39 An MLIR Bytecode file is comprised of a byte stream, with a few simple
  40 structural concepts layered on top.
  41
  42 ### Primitives
  43
  44 #### Fixed-Width Integers
  45
  46 ```
  47   byte ::= `0x00`...`0xFF`
  48 ```
  49
  50 Fixed width integers are unsigned integers of a known byte size. The values are
  51 stored in little-endian byte order.
  52
  53 TODO: Add larger fixed width integers as necessary.
  54
  55 #### Variable-Width Integers
  56
  57 Variable width integers, or `VarInt`s, provide a compact representation for
  58 integers. Each encoded VarInt consists of one to nine bytes, which together
  59 represent a single 64-bit value. The MLIR bytecode utilizes the "PrefixVarInt"
  60 encoding for VarInts. This encoding is a variant of the
  61 [LEB128 ("Little-Endian Base 128")](https://en.wikipedia.org/wiki/LEB128)
  62 encoding, where each byte of the encoding provides up to 7 bits for the value,
  63 with the remaining bit used to store a tag indicating the number of bytes used
  64 for the encoding. This means that small unsigned integers (less than 2^7) may be
  65 stored in one byte, unsigned integers up to 2^14 may be stored in two bytes,
  66 etc.
  67
  68 The first byte of the encoding includes a length prefix in the low bits. This
  69 prefix is a bit sequence of '0's followed by a terminal '1', or the end of the
  70 byte. The number of '0' bits indicate the number of _additional_ bytes, not
  71 including the prefix byte, used to encode the value. All of the remaining bits
  72 in the first byte, along with all of the bits in the additional bytes, provide
  73 the value of the integer. Below are the various possible encodings of the prefix
  74 byte:
  75
  76 ```
  77 xxxxxxx1:  7 value bits, the encoding uses 1 byte
  78 xxxxxx10: 14 value bits, the encoding uses 2 bytes
  79 xxxxx100: 21 value bits, the encoding uses 3 bytes
  80 xxxx1000: 28 value bits, the encoding uses 4 bytes
  81 xxx10000: 35 value bits, the encoding uses 5 bytes
  82 xx100000: 42 value bits, the encoding uses 6 bytes
  83 x1000000: 49 value bits, the encoding uses 7 bytes
  84 10000000: 56 value bits, the encoding uses 8 bytes
  85 00000000: 64 value bits, the encoding uses 9 bytes
  86 ```
  87
  88 ##### Signed Variable-Width Integers
  89
  90 Signed variable width integer values are encoded in a similar fashion to
  91 [varints](#variable-width-integers), but employ
  92 [zigzag encoding](https://en.wikipedia.org/wiki/Variable-length_quantity#Zigzag_encoding).
  93 This encoding uses the low bit of the value to indicate the sign, which allows
  94 for more efficiently encoding negative numbers. If a negative value were encoded
  95 using a normal [varint](#variable-width-integers), it would be treated as an
  96 extremely large unsigned value. Using zigzag encoding allows for a smaller
  97 number of active bits in the value, leading to a smaller encoding. Below is the
  98 basic computation for generating a zigzag encoding:
  99
 100 ```
 101 (value << 1) ^ (value >> 63)
 102 ```
 103
 104 #### Strings
 105
 106 Strings are blobs of characters with an associated length.
 107
 108 ### Sections
 109
 110 ```
 111 section {
 112   idAndIsAligned: byte // id | (hasAlign << 7)
 113   length: varint,
 114
 115   alignment: varint?,
 116   padding: byte[], // Padding bytes are always `0xCB`.
 117
 118   data: byte[]
 119 }
 120 ```
 121
 122 Sections are a mechanism for grouping data within the bytecode. They enable
 123 delayed processing, which is useful for out-of-order processing of data,
 124 lazy-loading, and more. Each section contains a Section ID, whose high bit
 125 indicates if the section has alignment requirements, a length (which allows for
 126 skipping over the section), and an optional alignment. When an alignment is
 127 present, a variable number of padding bytes (0xCB) may appear before the section
 128 data. The alignment of a section must be a power of 2.
 129
 130 ## MLIR Encoding
 131
 132 Given the generic structure of MLIR, the bytecode encoding is actually fairly
 133 simplistic. It effectively maps to the core components of MLIR.
 134
 135 ### Top Level Structure
 136
 137 The top-level structure of the bytecode contains the 4-byte "magic number", a
 138 version number, a null-terminated producer string, and a list of sections. Each
 139 section is currently only expected to appear once within a bytecode file.
 140
 141 ```
 142 bytecode {
 143   magic: "MLïR",
 144   version: varint,
 145   producer: string,
 146   sections: section[]
 147 }
 148 ```
 149
 150 ### String Section
 151
 152 ```
 153 strings {
 154   numStrings: varint,
 155   reverseStringLengths: varint[],
 156   stringData: byte[]
 157 }
 158 ```
 159
 160 The string section contains a table of strings referenced within the bytecode,
 161 more easily enabling string sharing. This section is encoded first with the
 162 total number of strings, followed by the sizes of each of the individual strings
 163 in reverse order. The remaining encoding contains a single blob containing all
 164 of the strings concatenated together.
 165
 166 ### Dialect Section
 167
 168 The dialect section of the bytecode contains all of the dialects referenced
 169 within the encoded IR, and some information about the components of those
 170 dialects that were also referenced.
 171
 172 ```
 173 dialect_section {
 174   numDialects: varint,
 175   dialectNames: dialect_name_group[],
 176   opNames: dialect_ops_group[]  // ops grouped by dialect
 177 }
 178
 179 dialect_name_group {
 180   nameAndIsVersioned: varint  // (dialectID << 1) | (hasVersion),
 181   version: dialect_version_section  // only if versioned
 182 }
 183
 184 dialect_version_section {
 185   size: varint,
 186   version: byte[]
 187 }
 188
 189 dialect_ops_group {
 190   dialect: varint,
 191   numOpNames: varint,
 192   opNames: op_name_group[]
 193 }
 194
 195 op_name_group {
 196   nameAndIsRegistered: varint  // (nameID << 1) | (isRegisteredOp)
 197 }
 198 ```
 199
 200 Dialects are encoded as a `varint` containing the index to the name string
 201 within the string section, plus a flag indicating whether the dialect is
 202 versioned. Operation names are encoded in groups by dialect, with each group
 203 containing the dialect, the number of operation names, and the array of indexes
 204 to each name within the string section. The version is encoded as a nested
 205 section for each dialect.
 206
 207 ### Attribute/Type Sections
 208
 209 Attributes and types are encoded using two [sections](#sections), one section
 210 (`attr_type_section`) containing the actual encoded representation, and another
 211 section (`attr_type_offset_section`) containing the offsets of each encoded
 212 attribute/type into the previous section. This structure allows for attributes
 213 and types to always be lazily loaded on demand.
 214
 215 ```
 216 attr_type_section {
 217   attrs: attribute[],
 218   types: type[]
 219 }
 220 attr_type_offset_section {
 221   numAttrs: varint,
 222   numTypes: varint,
 223   offsets: attr_type_offset_group[]
 224 }
 225
 226 attr_type_offset_group {
 227   dialect: varint,
 228   numElements: varint,
 229   offsets: varint[] // (offset << 1) | (hasCustomEncoding)
 230 }
 231
 232 attribute {
 233   encoding: ...
 234 }
 235 type {
 236   encoding: ...
 237 }
 238 ```
 239
 240 Each `offset` in the `attr_type_offset_section` above is the size of the
 241 encoding for the attribute or type and a flag indicating if the encoding uses
 242 the textual assembly format, or a custom bytecode encoding. We avoid using the
 243 direct offset into the `attr_type_section`, as a smaller relative offsets
 244 provides more effective compression. Attributes and types are grouped by
 245 dialect, with each `attr_type_offset_group` in the offset section containing the
 246 corresponding parent dialect, number of elements, and offsets for each element
 247 within the group.
 248
 249 #### Attribute/Type Encodings
 250
 251 In the abstract, an attribute/type is encoded in one of two possible ways: via
 252 its assembly format, or via a custom dialect defined encoding.
 253
 254 ##### Assembly Format Fallback
 255
 256 In the case where a dialect does not define a method for encoding the attribute
 257 or type, the textual assembly format of that attribute or type is used as a
 258 fallback. For example, a type `!bytecode.type<42>` would be encoded as the null
 259 terminated string "!bytecode.type<42>". This ensures that every attribute and
 260 type can be encoded, even if the owning dialect has not yet opted in to a more
 261 efficient serialization.
 262
 263 TODO: We shouldn't redundantly encode the dialect name here, we should use a
 264 reference to the parent dialect instead.
 265
 266 ##### Dialect Defined Encoding
 267
 268 As an alternative to the assembly format fallback, dialects may also provide a
 269 custom encoding for their attributes and types. Custom encodings are very
 270 beneficial in that they are significantly smaller and faster to read and write.
 271
 272 Dialects can opt-in to providing custom encodings by implementing the
 273 `BytecodeDialectInterface`. This interface provides hooks, namely
 274 `readAttribute`/`readType` and `writeAttribute`/`writeType`, that will be used
 275 by the bytecode reader and writer. These hooks are provided a reader and writer
 276 implementation that can be used to encode various constructs in the underlying
 277 bytecode format. A unique feature of this interface is that dialects may choose
 278 to only encode a subset of their attributes and types in a custom bytecode
 279 format, which can simplify adding new or experimental components that aren't
 280 fully baked.
 281
 282 When implementing the bytecode interface, dialects are responsible for all
 283 aspects of the encoding. This includes the indicator for which kind of attribute
 284 or type is being encoded; the bytecode reader will only know that it has
 285 encountered an attribute or type of a given dialect, it doesn't encode any
 286 further information. As such, a common encoding idiom is to use a leading
 287 `varint` code to indicate how the attribute or type was encoded.
 288
 289 ### Resource Section
 290
 291 Resources are encoded using two [sections](#sections), one section
 292 (`resource_section`) containing the actual encoded representation, and another
 293 section (`resource_offset_section`) containing the offsets of each encoded
 294 resource into the previous section.
 295
 296 ```
 297 resource_section {
 298   resources: resource[]
 299 }
 300 resource {
 301   value: resource_bool | resource_string | resource_blob
 302 }
 303 resource_bool {
 304   value: byte
 305 }
 306 resource_string {
 307   value: varint
 308 }
 309 resource_blob {
 310   alignment: varint,
 311   size: varint,
 312   padding: byte[],
 313   blob: byte[]
 314 }
 315
 316 resource_offset_section {
 317   numExternalResourceGroups: varint,
 318   resourceGroups: resource_group[]
 319 }
 320 resource_group {
 321   key: varint,
 322   numResources: varint,
 323   resources: resource_info[]
 324 }
 325 resource_info {
 326   key: varint,
 327   size: varint
 328   kind: byte,
 329 }
 330 ```
 331
 332 Resources are grouped by the provider, either an external entity or a dialect,
 333 with each `resource_group` in the offset section containing the corresponding
 334 provider, number of elements, and info for each element within the group. For
 335 each element, we record the key, the value kind, and the encoded size. We avoid
 336 using the direct offset into the `resource_section`, as a smaller relative
 337 offsets provides more effective compression.
 338
 339 ### IR Section
 340
 341 The IR section contains the encoded form of operations within the bytecode.
 342
 343 ```
 344 ir_section {
 345   block: block; // Single block without arguments.
 346 }
 347 ```
 348
 349 #### Operation Encoding
 350
 351 ```
 352 op {
 353   name: varint,
 354   encodingMask: byte,
 355   location: varint,
 356
 357   attrDict: varint?,
 358
 359   numResults: varint?,
 360   resultTypes: varint[],
 361
 362   numOperands: varint?,
 363   operands: varint[],
 364
 365   numSuccessors: varint?,
 366   successors: varint[],
 367
 368   numUseListOrders: varint?,
 369   useListOrders: uselist[],
 370
 371   regionEncoding: varint?, // (numRegions << 1) | (isIsolatedFromAbove)
 372
 373   // regions are stored in a section if isIsolatedFromAbove
 374   regions: (region | region_section)[]
 375 }
 376
 377 uselist {
 378   indexInRange: varint?,
 379   useListEncoding: varint, // (numIndices << 1) | (isIndexPairEncoding)
 380   indices: varint[]
 381 }
 382 ```
 383
 384 The encoding of an operation is important because this is generally the most
 385 commonly appearing structure in the bytecode. A single encoding is used for
 386 every type of operation. Given this prevalence, many of the fields of an
 387 operation are optional. The `encodingMask` field is a bitmask which indicates
 388 which of the components of the operation are present.
 389
 390 ##### Location
 391
 392 The location is encoded as the index of the location within the attribute table.
 393
 394 ##### Attributes
 395
 396 If the operation has attribues, the index of the operation attribute dictionary
 397 within the attribute table is encoded.
 398
 399 ##### Results
 400
 401 If the operation has results, the number of results and the indexes of the
 402 result types within the type table are encoded.
 403
 404 ##### Operands
 405
 406 If the operation has operands, the number of operands and the value index of
 407 each operand is encoded. This value index is the relative ordering of the
 408 definition of that value from the start of the first ancestor isolated region.
 409
 410 ##### Successors
 411
 412 If the operation has successors, the number of successors and the indexes of the
 413 successor blocks within the parent region are encoded.
 414
 415 ##### Use-list orders
 416
 417 The reference use-list order is assumed to be the reverse of the global
 418 enumeration of all the op operands that one would obtain with a pre-order walk
 419 of the IR. This order is naturally obtained by building blocks of operations
 420 op-by-op. However, some transformations may shuffle the use-lists with respect
 421 to this reference ordering. If any of the results of the operation have a
 422 use-list order that is not sorted with respect to the reference use-list order,
 423 an encoding is emitted such that it is possible to reconstruct such order after
 424 parsing the bytecode. The encoding represents an index map from the reference
 425 operand order to the current use-list order. A bit flag is used to detect if
 426 this encoding is of type index-pair or not. When the bit flag is set to zero,
 427 the element at `i` represent the position of the use `i` of the reference list
 428 into the current use-list. When the bit flag is set to `1`, the encoding
 429 represent index pairs `(i, j)`, which indicate that the use at position `i` of
 430 the reference list is mapped to position `j` in the current use-list. When only
 431 less than half of the elements in the current use-list are shuffled with respect
 432 to the reference use-list, the index-pair encoding is used to reduce the
 433 bytecode memory requirements.
 434
 435 ##### Regions
 436
 437 If the operation has regions, the number of regions and if the regions are
 438 isolated from above are encoded together in a single varint. Afterwards, each
 439 region is encoded inline.
 440
 441 #### Region Encoding
 442
 443 ```
 444 region {
 445   numBlocks: varint,
 446
 447   numValues: varint?,
 448   blocks: block[]
 449 }
 450 ```
 451
 452 A region is encoded first with the number of blocks within. If the region is
 453 non-empty, the number of values defined directly within the region are encoded,
 454 followed by the blocks of the region.
 455
 456 #### Block Encoding
 457
 458 ```
 459 block {
 460   encoding: varint, // (numOps << 1) | (hasBlockArgs)
 461   arguments: block_arguments?, // Optional based on encoding
 462   ops : op[]
 463 }
 464
 465 block_arguments {
 466   numArgs: varint?,
 467   args: block_argument[]
 468   numUseListOrders: varint?,
 469   useListOrders: uselist[],
 470 }
 471
 472 block_argument {
 473   typeAndLocation: varint, // (type << 1) | (hasLocation)
 474   location: varint? // Optional, else unknown location
 475 }
 476 ```
 477
 478 A block is encoded with an array of operations and block arguments. The first
 479 field is an encoding that combines the number of operations in the block, with a
 480 flag indicating if the block has arguments.
 481
 482 Use-list orders are attached to block arguments similarly to how they are
 483 attached to operation results.