docs/BitCodeFormat.html

   1 <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
   2                       "http://www.w3.org/TR/html4/strict.dtd">
   3 <html>
   4 <head>
   5   <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
   6   <title>LLVM Bitcode File Format</title>
   7   <link rel="stylesheet" href="llvm.css" type="text/css">
   8 </head>
   9 <body>
  10 <div class="doc_title"> LLVM Bitcode File Format </div>
  11 <ol>
  12   <li><a href="#abstract">Abstract</a></li>
  13   <li><a href="#overview">Overview</a></li>
  14   <li><a href="#bitstream">Bitstream Format</a>
  15     <ol>
  16     <li><a href="#magic">Magic Numbers</a></li>
  17     <li><a href="#primitives">Primitives</a></li>
  18     <li><a href="#abbrevid">Abbreviation IDs</a></li>
  19     <li><a href="#blocks">Blocks</a></li>
  20     <li><a href="#datarecord">Data Records</a></li>
  21     <li><a href="#abbreviations">Abbreviations</a></li>
  22     <li><a href="#stdblocks">Standard Blocks</a></li>
  23     </ol>
  24   </li>
  25   <li><a href="#wrapper">Bitcode Wrapper Format</a>
  26   </li>
  27   <li><a href="#llvmir">LLVM IR Encoding</a>
  28     <ol>
  29     <li><a href="#basics">Basics</a></li>
  30     </ol>
  31   </li>
  32 </ol>
  33 <div class="doc_author">
  34   <p>Written by <a href="mailto:sabre@nondot.org">Chris Lattner</a>
  35   and <a href="http://www.reverberate.org">Joshua Haberman</a>.
  36 </p>
  37 </div>
  38
  39 <!-- *********************************************************************** -->
  40 <div class="doc_section"> <a name="abstract">Abstract</a></div>
  41 <!-- *********************************************************************** -->
  42
  43 <div class="doc_text">
  44
  45 <p>This document describes the LLVM bitstream file format and the encoding of
  46 the LLVM IR into it.</p>
  47
  48 </div>
  49
  50 <!-- *********************************************************************** -->
  51 <div class="doc_section"> <a name="overview">Overview</a></div>
  52 <!-- *********************************************************************** -->
  53
  54 <div class="doc_text">
  55
  56 <p>
  57 What is commonly known as the LLVM bitcode file format (also, sometimes
  58 anachronistically known as bytecode) is actually two things: a <a
  59 href="#bitstream">bitstream container format</a>
  60 and an <a href="#llvmir">encoding of LLVM IR</a> into the container format.</p>
  61
  62 <p>
  63 The bitstream format is an abstract encoding of structured data, very
  64 similar to XML in some ways.  Like XML, bitstream files contain tags, and nested
  65 structures, and you can parse the file without having to understand the tags.
  66 Unlike XML, the bitstream format is a binary encoding, and unlike XML it
  67 provides a mechanism for the file to self-describe "abbreviations", which are
  68 effectively size optimizations for the content.</p>
  69
  70 <p>LLVM IR files may be optionally embedded into a <a
  71 href="#wrapper">wrapper</a> structure that makes it easy to embed extra data
  72 along with LLVM IR files.</p>
  73
  74 <p>This document first describes the LLVM bitstream format, describes the
  75 wrapper format, then describes the record structure used by LLVM IR files.
  76 </p>
  77
  78 </div>
  79
  80 <!-- *********************************************************************** -->
  81 <div class="doc_section"> <a name="bitstream">Bitstream Format</a></div>
  82 <!-- *********************************************************************** -->
  83
  84 <div class="doc_text">
  85
  86 <p>
  87 The bitstream format is literally a stream of bits, with a very simple
  88 structure.  This structure consists of the following concepts:
  89 </p>
  90
  91 <ul>
  92 <li>A "<a href="#magic">magic number</a>" that identifies the contents of
  93     the stream.</li>
  94 <li>Encoding <a href="#primitives">primitives</a> like variable bit-rate
  95     integers.</li>
  96 <li><a href="#blocks">Blocks</a>, which define nested content.</li>
  97 <li><a href="#datarecord">Data Records</a>, which describe entities within the
  98     file.</li>
  99 <li>Abbreviations, which specify compression optimizations for the file.</li>
 100 </ul>
 101
 102 <p>Note that the <a
 103 href="CommandGuide/html/llvm-bcanalyzer.html">llvm-bcanalyzer</a> tool can be
 104 used to dump and inspect arbitrary bitstreams, which is very useful for
 105 understanding the encoding.</p>
 106
 107 </div>
 108
 109 <!-- ======================================================================= -->
 110 <div class="doc_subsection"><a name="magic">Magic Numbers</a>
 111 </div>
 112
 113 <div class="doc_text">
 114
 115 <p>The first two bytes of a bitcode file are 'BC' (0x42, 0x43).
 116 The second two bytes are an application-specific magic number.  Generic
 117 bitcode tools can look at only the first two bytes to verify the file is
 118 bitcode, while application-specific programs will want to look at all four.</p>
 119
 120 </div>
 121
 122 <!-- ======================================================================= -->
 123 <div class="doc_subsection"><a name="primitives">Primitives</a>
 124 </div>
 125
 126 <div class="doc_text">
 127
 128 <p>
 129 A bitstream literally consists of a stream of bits, which are read in order
 130 starting with the least significant bit of each byte.  The stream is made up of a
 131 number of primitive values that encode a stream of unsigned integer values.
 132 These
 133 integers are are encoded in two ways: either as <a href="#fixedwidth">Fixed
 134 Width Integers</a> or as <a href="#variablewidth">Variable Width
 135 Integers</a>.
 136 </p>
 137
 138 </div>
 139
 140 <!-- _______________________________________________________________________ -->
 141 <div class="doc_subsubsection"> <a name="fixedwidth">Fixed Width Integers</a>
 142 </div>
 143
 144 <div class="doc_text">
 145
 146 <p>Fixed-width integer values have their low bits emitted directly to the file.
 147    For example, a 3-bit integer value encodes 1 as 001.  Fixed width integers
 148    are used when there are a well-known number of options for a field.  For
 149    example, boolean values are usually encoded with a 1-bit wide integer.
 150 </p>
 151
 152 </div>
 153
 154 <!-- _______________________________________________________________________ -->
 155 <div class="doc_subsubsection"> <a name="variablewidth">Variable Width
 156 Integers</a></div>
 157
 158 <div class="doc_text">
 159
 160 <p>Variable-width integer (VBR) values encode values of arbitrary size,
 161 optimizing for the case where the values are small.  Given a 4-bit VBR field,
 162 any 3-bit value (0 through 7) is encoded directly, with the high bit set to
 163 zero.  Values larger than N-1 bits emit their bits in a series of N-1 bit
 164 chunks, where all but the last set the high bit.</p>
 165
 166 <p>For example, the value 27 (0x1B) is encoded as 1011 0011 when emitted as a
 167 vbr4 value.  The first set of four bits indicates the value 3 (011) with a
 168 continuation piece (indicated by a high bit of 1).  The next word indicates a
 169 value of 24 (011 << 3) with no continuation.  The sum (3+24) yields the value
 170 27.
 171 </p>
 172
 173 </div>
 174
 175 <!-- _______________________________________________________________________ -->
 176 <div class="doc_subsubsection"> <a name="char6">6-bit characters</a></div>
 177
 178 <div class="doc_text">
 179
 180 <p>6-bit characters encode common characters into a fixed 6-bit field.  They
 181 represent the following characters with the following 6-bit values:</p>
 182
 183 <div class="doc_code">
 184 <pre>
 185 'a' .. 'z' &mdash;  0 .. 25
 186 'A' .. 'Z' &mdash; 26 .. 51
 187 '0' .. '9' &mdash; 52 .. 61
 188        '.' &mdash; 62
 189        '_' &mdash; 63
 190 </pre>
 191 </div>
 192
 193 <p>This encoding is only suitable for encoding characters and strings that
 194 consist only of the above characters.  It is completely incapable of encoding
 195 characters not in the set.</p>
 196
 197 </div>
 198
 199 <!-- _______________________________________________________________________ -->
 200 <div class="doc_subsubsection"> <a name="wordalign">Word Alignment</a></div>
 201
 202 <div class="doc_text">
 203
 204 <p>Occasionally, it is useful to emit zero bits until the bitstream is a
 205 multiple of 32 bits.  This ensures that the bit position in the stream can be
 206 represented as a multiple of 32-bit words.</p>
 207
 208 </div>
 209
 210
 211 <!-- ======================================================================= -->
 212 <div class="doc_subsection"><a name="abbrevid">Abbreviation IDs</a>
 213 </div>
 214
 215 <div class="doc_text">
 216
 217 <p>
 218 A bitstream is a sequential series of <a href="#blocks">Blocks</a> and
 219 <a href="#datarecord">Data Records</a>.  Both of these start with an
 220 abbreviation ID encoded as a fixed-bitwidth field.  The width is specified by
 221 the current block, as described below.  The value of the abbreviation ID
 222 specifies either a builtin ID (which have special meanings, defined below) or
 223 one of the abbreviation IDs defined by the stream itself.
 224 </p>
 225
 226 <p>
 227 The set of builtin abbrev IDs is:
 228 </p>
 229
 230 <ul>
 231 <li><tt>0 - <a href="#END_BLOCK">END_BLOCK</a></tt> &mdash; This abbrev ID marks
 232     the end of the current block.</li>
 233 <li><tt>1 - <a href="#ENTER_SUBBLOCK">ENTER_SUBBLOCK</a></tt> &mdash; This
 234     abbrev ID marks the beginning of a new block.</li>
 235 <li><tt>2 - <a href="#DEFINE_ABBREV">DEFINE_ABBREV</a></tt> &mdash; This defines
 236     a new abbreviation.</li>
 237 <li><tt>3 - <a href="#UNABBREV_RECORD">UNABBREV_RECORD</a></tt> &mdash; This ID
 238     specifies the definition of an unabbreviated record.</li>
 239 </ul>
 240
 241 <p>Abbreviation IDs 4 and above are defined by the stream itself, and specify
 242 an <a href="#abbrev_records">abbreviated record encoding</a>.</p>
 243
 244 </div>
 245
 246 <!-- ======================================================================= -->
 247 <div class="doc_subsection"><a name="blocks">Blocks</a>
 248 </div>
 249
 250 <div class="doc_text">
 251
 252 <p>
 253 Blocks in a bitstream denote nested regions of the stream, and are identified by
 254 a content-specific id number (for example, LLVM IR uses an ID of 12 to represent
 255 function bodies).  Block IDs 0-7 are reserved for <a href="#stdblocks">standard blocks</a>
 256 whose meaning is defined by Bitcode; block IDs 8 and greater are
 257 application specific. Nested blocks capture the hierachical structure of the data
 258 encoded in it, and various properties are associated with blocks as the file is
 259 parsed.  Block definitions allow the reader to efficiently skip blocks
 260 in constant time if the reader wants a summary of blocks, or if it wants to
 261 efficiently skip data they do not understand.  The LLVM IR reader uses this
 262 mechanism to skip function bodies, lazily reading them on demand.
 263 </p>
 264
 265 <p>
 266 When reading and encoding the stream, several properties are maintained for the
 267 block.  In particular, each block maintains:
 268 </p>
 269
 270 <ol>
 271 <li>A current abbrev id width.  This value starts at 2, and is set every time a
 272     block record is entered.  The block entry specifies the abbrev id width for
 273     the body of the block.</li>
 274
 275 <li>A set of abbreviations.  Abbreviations may be defined within a block, in
 276     which case they are only defined in that block (neither subblocks nor
 277     enclosing blocks see the abbreviation).  Abbreviations can also be defined
 278     inside a <tt><a href="#BLOCKINFO">BLOCKINFO</a></tt> block, in which case
 279     they are defined in all blocks that match the ID that the BLOCKINFO block is
 280     describing.
 281 </li>
 282 </ol>
 283
 284 <p>
 285 As sub blocks are entered, these properties are saved and the new sub-block has
 286 its own set of abbreviations, and its own abbrev id width.  When a sub-block is
 287 popped, the saved values are restored.
 288 </p>
 289
 290 </div>
 291
 292 <!-- _______________________________________________________________________ -->
 293 <div class="doc_subsubsection"> <a name="ENTER_SUBBLOCK">ENTER_SUBBLOCK
 294 Encoding</a></div>
 295
 296 <div class="doc_text">
 297
 298 <p><tt>[ENTER_SUBBLOCK, blockid<sub>vbr8</sub>, newabbrevlen<sub>vbr4</sub>,
 299      &lt;align32bits&gt;, blocklen<sub>32</sub>]</tt></p>
 300
 301 <p>
 302 The <tt>ENTER_SUBBLOCK</tt> abbreviation ID specifies the start of a new block
 303 record.  The <tt>blockid</tt> value is encoded as an 8-bit VBR identifier, and
 304 indicates the type of block being entered, which can be
 305 a <a href="#stdblocks">standard block</a> or an application-specific block.
 306 The <tt>newabbrevlen</tt> value is a 4-bit VBR, which specifies the abbrev id
 307 width for the sub-block.  The <tt>blocklen</tt> value is a 32-bit aligned value
 308 that specifies the size of the subblock in 32-bit words. This value allows the
 309 reader to skip over the entire block in one jump.
 310 </p>
 311
 312 </div>
 313
 314 <!-- _______________________________________________________________________ -->
 315 <div class="doc_subsubsection"> <a name="END_BLOCK">END_BLOCK
 316 Encoding</a></div>
 317
 318 <div class="doc_text">
 319
 320 <p><tt>[END_BLOCK, &lt;align32bits&gt;]</tt></p>
 321
 322 <p>
 323 The <tt>END_BLOCK</tt> abbreviation ID specifies the end of the current block
 324 record.  Its end is aligned to 32-bits to ensure that the size of the block is
 325 an even multiple of 32-bits.
 326 </p>
 327
 328 </div>
 329
 330
 331
 332 <!-- ======================================================================= -->
 333 <div class="doc_subsection"><a name="datarecord">Data Records</a>
 334 </div>
 335
 336 <div class="doc_text">
 337 <p>
 338 Data records consist of a record code and a number of (up to) 64-bit integer
 339 values.  The interpretation of the code and values is application specific and
 340 there are multiple different ways to encode a record (with an unabbrev record or
 341 with an abbreviation).  In the LLVM IR format, for example, there is a record
 342 which encodes the target triple of a module.  The code is
 343 <tt>MODULE_CODE_TRIPLE</tt>, and the values of the record are the ASCII codes
 344 for the characters in the string.
 345 </p>
 346
 347 </div>
 348
 349 <!-- _______________________________________________________________________ -->
 350 <div class="doc_subsubsection"> <a name="UNABBREV_RECORD">UNABBREV_RECORD
 351 Encoding</a></div>
 352
 353 <div class="doc_text">
 354
 355 <p><tt>[UNABBREV_RECORD, code<sub>vbr6</sub>, numops<sub>vbr6</sub>,
 356        op0<sub>vbr6</sub>, op1<sub>vbr6</sub>, ...]</tt></p>
 357
 358 <p>
 359 An <tt>UNABBREV_RECORD</tt> provides a default fallback encoding, which is both
 360 completely general and extremely inefficient.  It can describe an arbitrary
 361 record by emitting the code and operands as vbrs.
 362 </p>
 363
 364 <p>
 365 For example, emitting an LLVM IR target triple as an unabbreviated record
 366 requires emitting the <tt>UNABBREV_RECORD</tt> abbrevid, a vbr6 for the
 367 <tt>MODULE_CODE_TRIPLE</tt> code, a vbr6 for the length of the string, which is
 368 equal to the number of operands, and a vbr6 for each character.  Because there
 369 are no letters with values less than 32, each letter would need to be emitted as
 370 at least a two-part VBR, which means that each letter would require at least 12
 371 bits.  This is not an efficient encoding, but it is fully general.
 372 </p>
 373
 374 </div>
 375
 376 <!-- _______________________________________________________________________ -->
 377 <div class="doc_subsubsection"> <a name="abbrev_records">Abbreviated Record
 378 Encoding</a></div>
 379
 380 <div class="doc_text">
 381
 382 <p><tt>[&lt;abbrevid&gt;, fields...]</tt></p>
 383
 384 <p>
 385 An abbreviated record is a abbreviation id followed by a set of fields that are
 386 encoded according to the <a href="#abbreviations">abbreviation definition</a>.
 387 This allows records to be encoded significantly more densely than records
 388 encoded with the <tt><a href="#UNABBREV_RECORD">UNABBREV_RECORD</a></tt> type,
 389 and allows the abbreviation types to be specified in the stream itself, which
 390 allows the files to be completely self describing.  The actual encoding of
 391 abbreviations is defined below.
 392 </p>
 393
 394 </div>
 395
 396 <!-- ======================================================================= -->
 397 <div class="doc_subsection"><a name="abbreviations">Abbreviations</a>
 398 </div>
 399
 400 <div class="doc_text">
 401 <p>
 402 Abbreviations are an important form of compression for bitstreams.  The idea is
 403 to specify a dense encoding for a class of records once, then use that encoding
 404 to emit many records.  It takes space to emit the encoding into the file, but
 405 the space is recouped (hopefully plus some) when the records that use it are
 406 emitted.
 407 </p>
 408
 409 <p>
 410 Abbreviations can be determined dynamically per client, per file. Because the
 411 abbreviations are stored in the bitstream itself, different streams of the same
 412 format can contain different sets of abbreviations if the specific stream does
 413 not need it.  As a concrete example, LLVM IR files usually emit an abbreviation
 414 for binary operators.  If a specific LLVM module contained no or few binary
 415 operators, the abbreviation does not need to be emitted.
 416 </p>
 417 </div>
 418
 419 <!-- _______________________________________________________________________ -->
 420 <div class="doc_subsubsection"><a name="DEFINE_ABBREV">DEFINE_ABBREV
 421  Encoding</a></div>
 422
 423 <div class="doc_text">
 424
 425 <p><tt>[DEFINE_ABBREV, numabbrevops<sub>vbr5</sub>, abbrevop0, abbrevop1,
 426  ...]</tt></p>
 427
 428 <p>
 429 A <tt>DEFINE_ABBREV</tt> record adds an abbreviation to the list of currently
 430 defined abbreviations in the scope of this block.  This definition only exists
 431 inside this immediate block &mdash; it is not visible in subblocks or enclosing
 432 blocks.  Abbreviations are implicitly assigned IDs sequentially starting from 4
 433 (the first application-defined abbreviation ID).  Any abbreviations defined in a
 434 <tt>BLOCKINFO</tt> record receive IDs first, in order, followed by any
 435 abbreviations defined within the block itself.  Abbreviated data records
 436 reference this ID to indicate what abbreviation they are invoking.
 437 </p>
 438
 439 <p>
 440 An abbreviation definition consists of the <tt>DEFINE_ABBREV</tt> abbrevid
 441 followed by a VBR that specifies the number of abbrev operands, then the abbrev
 442 operands themselves.  Abbreviation operands come in three forms.  They all start
 443 with a single bit that indicates whether the abbrev operand is a literal operand
 444 (when the bit is 1) or an encoding operand (when the bit is 0).
 445 </p>
 446
 447 <ol>
 448 <li>Literal operands &mdash; <tt>[1<sub>1</sub>, litvalue<sub>vbr8</sub>]</tt>
 449 &mdash; Literal operands specify that the value in the result is always a single
 450 specific value.  This specific value is emitted as a vbr8 after the bit
 451 indicating that it is a literal operand.</li>
 452 <li>Encoding info without data &mdash; <tt>[0<sub>1</sub>,
 453  encoding<sub>3</sub>]</tt> &mdash; Operand encodings that do not have extra
 454  data are just emitted as their code.
 455 </li>
 456 <li>Encoding info with data &mdash; <tt>[0<sub>1</sub>, encoding<sub>3</sub>,
 457 value<sub>vbr5</sub>]</tt> &mdash; Operand encodings that do have extra data are
 458 emitted as their code, followed by the extra data.
 459 </li>
 460 </ol>
 461
 462 <p>The possible operand encodings are:</p>
 463
 464 <ol>
 465 <li>Fixed: The field should be emitted as
 466     a <a href="#fixedwidth">fixed-width value</a>, whose width is specified by
 467     the operand's extra data.</li>
 468 <li>VBR: The field should be emitted as
 469     a <a href="#variablewidth">variable-width value</a>, whose width is
 470     specified by the operand's extra data.</li>
 471 <li>Array: This field is an array of values.  The array operand
 472     has no extra data, but expects another operand to follow it which indicates
 473     the element type of the array.  When reading an array in an abbreviated
 474     record, the first integer is a vbr6 that indicates the array length,
 475     followed by the encoded elements of the array.  An array may only occur as
 476     the last operand of an abbreviation (except for the one final operand that
 477     gives the array's type).</li>
 478 <li>Char6: This field should be emitted as
 479     a <a href="#char6">char6-encoded value</a>.  This operand type takes no
 480     extra data.</li>
 481 <li>Blob: This field is emitted as a vbr6, followed by padding to a
 482     32-bit boundary (for alignment) and an array of 8-bit objects.  The array of
 483     bytes is further followed by tail padding to ensure that its total length is
 484     a multiple of 4 bytes.  This makes it very efficient for the reader to
 485     decode the data without having to make a copy of it: it can use a pointer to
 486     the data in the mapped in file and poke directly at it.  A blob may only
 487     occur as the last operand of an abbreviation.</li>
 488 </ol>
 489
 490 <p>
 491 For example, target triples in LLVM modules are encoded as a record of the
 492 form <tt>[TRIPLE, 'a', 'b', 'c', 'd']</tt>.  Consider if the bitstream emitted
 493 the following abbrev entry:
 494 </p>
 495
 496 <div class="doc_code">
 497 <pre>
 498 [0, Fixed, 4]
 499 [0, Array]
 500 [0, Char6]
 501 </pre>
 502 </div>
 503
 504 <p>
 505 When emitting a record with this abbreviation, the above entry would be emitted
 506 as:
 507 </p>
 508
 509 <div class="doc_code">
 510 <p>
 511 <tt>[4<sub>abbrevwidth</sub>, 2<sub>4</sub>, 4<sub>vbr6</sub>, 0<sub>6</sub>,
 512 1<sub>6</sub>, 2<sub>6</sub>, 3<sub>6</sub>]</tt>
 513 </p>
 514 </div>
 515
 516 <p>These values are:</p>
 517
 518 <ol>
 519 <li>The first value, 4, is the abbreviation ID for this abbreviation.</li>
 520 <li>The second value, 2, is the code for <tt>TRIPLE</tt> in LLVM IR files.</li>
 521 <li>The third value, 4, is the length of the array.</li>
 522 <li>The rest of the values are the char6 encoded values
 523     for <tt>"abcd"</tt>.</li>
 524 </ol>
 525
 526 <p>
 527 With this abbreviation, the triple is emitted with only 37 bits (assuming a
 528 abbrev id width of 3).  Without the abbreviation, significantly more space would
 529 be required to emit the target triple.  Also, because the <tt>TRIPLE</tt> value
 530 is not emitted as a literal in the abbreviation, the abbreviation can also be
 531 used for any other string value.
 532 </p>
 533
 534 </div>
 535
 536 <!-- ======================================================================= -->
 537 <div class="doc_subsection"><a name="stdblocks">Standard Blocks</a>
 538 </div>
 539
 540 <div class="doc_text">
 541
 542 <p>
 543 In addition to the basic block structure and record encodings, the bitstream
 544 also defines specific builtin block types.  These block types specify how the
 545 stream is to be decoded or other metadata.  In the future, new standard blocks
 546 may be added.  Block IDs 0-7 are reserved for standard blocks.
 547 </p>
 548
 549 </div>
 550
 551 <!-- _______________________________________________________________________ -->
 552 <div class="doc_subsubsection"><a name="BLOCKINFO">#0 - BLOCKINFO
 553 Block</a></div>
 554
 555 <div class="doc_text">
 556
 557 <p>
 558 The <tt>BLOCKINFO</tt> block allows the description of metadata for other
 559 blocks.  The currently specified records are:
 560 </p>
 561
 562 <div class="doc_code">
 563 <pre>
 564 [SETBID (#1), blockid]
 565 [DEFINE_ABBREV, ...]
 566 [BLOCKNAME, ...name...]
 567 [SETRECORDNAME, RecordID, ...name...]
 568 </pre>
 569 </div>
 570
 571 <p>
 572 The <tt>SETBID</tt> record indicates which block ID is being
 573 described.  <tt>SETBID</tt> records can occur multiple times throughout the
 574 block to change which block ID is being described.  There must be
 575 a <tt>SETBID</tt> record prior to any other records.
 576 </p>
 577
 578 <p>
 579 Standard <tt>DEFINE_ABBREV</tt> records can occur inside <tt>BLOCKINFO</tt>
 580 blocks, but unlike their occurrence in normal blocks, the abbreviation is
 581 defined for blocks matching the block ID we are describing, <i>not</i> the
 582 <tt>BLOCKINFO</tt> block itself.  The abbreviations defined
 583 in <tt>BLOCKINFO</tt> blocks receive abbreviation IDs as described
 584 in <tt><a href="#DEFINE_ABBREV">DEFINE_ABBREV</a></tt>.
 585 </p>
 586
 587 <p>The <tt>BLOCKNAME</tt> can optionally occur in this block.  The elements of
 588 the record are the bytes for the string name of the block.  llvm-bcanalyzer uses
 589 this to dump out bitcode files symbolically.</p>
 590
 591 <p>The <tt>SETRECORDNAME</tt> record can optionally occur in this block.  The
 592 first entry is a record ID number and the rest of the elements of the record are
 593 the bytes for the string name of the record.  llvm-bcanalyzer uses
 594 this to dump out bitcode files symbolically.</p>
 595
 596 <p>
 597 Note that although the data in <tt>BLOCKINFO</tt> blocks is described as
 598 "metadata," the abbreviations they contain are essential for parsing records
 599 from the corresponding blocks.  It is not safe to skip them.
 600 </p>
 601
 602 </div>
 603
 604 <!-- *********************************************************************** -->
 605 <div class="doc_section"> <a name="wrapper">Bitcode Wrapper Format</a></div>
 606 <!-- *********************************************************************** -->
 607
 608 <div class="doc_text">
 609
 610 <p>
 611 Bitcode files for LLVM IR may optionally be wrapped in a simple wrapper
 612 structure.  This structure contains a simple header that indicates the offset
 613 and size of the embedded BC file.  This allows additional information to be
 614 stored alongside the BC file.  The structure of this file header is:
 615 </p>
 616
 617 <div class="doc_code">
 618 <p>
 619 <tt>[Magic<sub>32</sub>, Version<sub>32</sub>, Offset<sub>32</sub>,
 620 Size<sub>32</sub>, CPUType<sub>32</sub>]</tt>
 621 </p>
 622 </div>
 623
 624 <p>
 625 Each of the fields are 32-bit fields stored in little endian form (as with
 626 the rest of the bitcode file fields).  The Magic number is always
 627 <tt>0x0B17C0DE</tt> and the version is currently always <tt>0</tt>.  The Offset
 628 field is the offset in bytes to the start of the bitcode stream in the file, and
 629 the Size field is a size in bytes of the stream. CPUType is a target-specific
 630 value that can be used to encode the CPU of the target.
 631 </p>
 632
 633 </div>
 634
 635 <!-- *********************************************************************** -->
 636 <div class="doc_section"> <a name="llvmir">LLVM IR Encoding</a></div>
 637 <!-- *********************************************************************** -->
 638
 639 <div class="doc_text">
 640
 641 <p>
 642 LLVM IR is encoded into a bitstream by defining blocks and records.  It uses
 643 blocks for things like constant pools, functions, symbol tables, etc.  It uses
 644 records for things like instructions, global variable descriptors, type
 645 descriptions, etc.  This document does not describe the set of abbreviations
 646 that the writer uses, as these are fully self-described in the file, and the
 647 reader is not allowed to build in any knowledge of this.
 648 </p>
 649
 650 </div>
 651
 652 <!-- ======================================================================= -->
 653 <div class="doc_subsection"><a name="basics">Basics</a>
 654 </div>
 655
 656 <!-- _______________________________________________________________________ -->
 657 <div class="doc_subsubsection"><a name="ir_magic">LLVM IR Magic Number</a></div>
 658
 659 <div class="doc_text">
 660
 661 <p>
 662 The magic number for LLVM IR files is:
 663 </p>
 664
 665 <div class="doc_code">
 666 <p>
 667 <tt>[0x0<sub>4</sub>, 0xC<sub>4</sub>, 0xE<sub>4</sub>, 0xD<sub>4</sub>]</tt>
 668 </p>
 669 </div>
 670
 671 <p>
 672 When combined with the bitcode magic number and viewed as bytes, this is
 673 <tt>"BC&nbsp;0xC0DE"</tt>.
 674 </p>
 675
 676 </div>
 677
 678 <!-- _______________________________________________________________________ -->
 679 <div class="doc_subsubsection"><a name="ir_signed_vbr">Signed VBRs</a></div>
 680
 681 <div class="doc_text">
 682
 683 <p>
 684 <a href="#variablewidth">Variable Width Integers</a> are an efficient way to
 685 encode arbitrary sized unsigned values, but is an extremely inefficient way to
 686 encode signed values (as signed values are otherwise treated as maximally large
 687 unsigned values).
 688 </p>
 689
 690 <p>
 691 As such, signed vbr values of a specific width are emitted as follows:
 692 </p>
 693
 694 <ul>
 695 <li>Positive values are emitted as vbrs of the specified width, but with their
 696     value shifted left by one.</li>
 697 <li>Negative values are emitted as vbrs of the specified width, but the negated
 698     value is shifted left by one, and the low bit is set.</li>
 699 </ul>
 700
 701 <p>
 702 With this encoding, small positive and small negative values can both be emitted
 703 efficiently.
 704 </p>
 705
 706 </div>
 707
 708
 709 <!-- _______________________________________________________________________ -->
 710 <div class="doc_subsubsection"><a name="ir_blocks">LLVM IR Blocks</a></div>
 711
 712 <div class="doc_text">
 713
 714 <p>
 715 LLVM IR is defined with the following blocks:
 716 </p>
 717
 718 <ul>
 719 <li>8  &mdash; <tt>MODULE_BLOCK</tt> &mdash; This is the top-level block that
 720     contains the entire module, and describes a variety of per-module
 721     information.</li>
 722 <li>9  &mdash; <tt>PARAMATTR_BLOCK</tt> &mdash; This enumerates the parameter
 723     attributes.</li>
 724 <li>10 &mdash; <tt>TYPE_BLOCK</tt> &mdash; This describes all of the types in
 725     the module.</li>
 726 <li>11 &mdash; <tt>CONSTANTS_BLOCK</tt> &mdash; This describes constants for a
 727     module or function.</li>
 728 <li>12 &mdash; <tt>FUNCTION_BLOCK</tt> &mdash; This describes a function
 729     body.</li>
 730 <li>13 &mdash; <tt>TYPE_SYMTAB_BLOCK</tt> &mdash; This describes the type symbol
 731     table.</li>
 732 <li>14 &mdash; <tt>VALUE_SYMTAB_BLOCK</tt> &mdash; This describes a value symbol
 733     table.</li>
 734 </ul>
 735
 736 </div>
 737
 738 <!-- ======================================================================= -->
 739 <div class="doc_subsection"><a name="MODULE_BLOCK">MODULE_BLOCK Contents</a>
 740 </div>
 741
 742 <div class="doc_text">
 743
 744 <p>
 745 </p>
 746
 747 </div>
 748
 749
 750 <!-- *********************************************************************** -->
 751 <hr>
 752 <address> <a href="http://jigsaw.w3.org/css-validator/check/referer"><img
 753  src="http://jigsaw.w3.org/css-validator/images/vcss-blue" alt="Valid CSS"></a>
 754 <a href="http://validator.w3.org/check/referer"><img
 755  src="http://www.w3.org/Icons/valid-html401-blue" alt="Valid HTML 4.01"></a>
 756  <a href="mailto:sabre@nondot.org">Chris Lattner</a><br>
 757 <a href="http://llvm.org">The LLVM Compiler Infrastructure</a><br>
 758 Last modified: $Date$
 759 </address>
 760 </body>
 761 </html>