llvm/docs/SymbolizerMarkupFormat.rst

   1 ==========================
   2 Symbolizer Markup Format
   3 ==========================
   4
   5 .. contents::
   6    :local:
   7
   8 Overview
   9 ========
  10
  11 This document defines a text format for log messages that can be processed by a
  12 symbolizing filter. The basic idea is that logging code emits text that contains
  13 raw address values and so forth, without the logging code doing any real work to
  14 convert those values to human-readable form. Instead, logging text uses the
  15 markup format defined here to identify pieces of information that should be
  16 converted to human-readable form after the fact. As with other markup formats,
  17 the expectation is that most of the text will be displayed as is, while the
  18 markup elements will be replaced with expanded text, or converted into active UI
  19 elements, that present more details in symbolic form.
  20
  21 This means there is no need for symbol tables, DWARF debugging sections, or
  22 similar information to be directly accessible at runtime. There is also no need
  23 at runtime for any logic intended to compute human-readable presentation of
  24 information, such as C++ symbol demangling. Instead, logging must include markup
  25 elements that give the contextual information necessary to make sense of the raw
  26 data, such as memory layout details.
  27
  28 This format identifies markup elements with a syntax that is both simple and
  29 distinctive. It's simple enough to be matched and parsed with straightforward
  30 code. It's distinctive enough that character sequences that look like the start
  31 or end of a markup element should rarely if ever appear incidentally in logging
  32 text. It's specifically intended not to require sanitizing plain text, such as
  33 the HTML/XML requirement to replace ``<`` with ``&lt;`` and the like.
  34
  35 :doc:`llvm-symbolizer <CommandGuide/llvm-symbolizer>` includes a symbolizing
  36 filter via its ``--filter-markup`` option. Also, LLVM utilites emit stack
  37 traces as markup when the ``LLVM_ENABLE_SYMBOLIZER_MARKUP`` environment
  38 variable is set.
  39
  40 Scope and assumptions
  41 =====================
  42
  43 A symbolizing filter implementation will be independent both of the target
  44 operating system and machine architecture where the logs are generated and of
  45 the host operating system and machine architecture where the filter runs.
  46
  47 This format assumes that the symbolizing filter processes intact whole lines. If
  48 long lines might be split during some stage of a logging pipeline, they must be
  49 reassembled to restore the original line breaks before feeding lines into the
  50 symbolizing filter. Most markup elements must appear entirely on a single line
  51 (often with other text before and/or after the markup element). There are some
  52 markup elements that are specified to span lines, with line breaks in the middle
  53 of the element. Even in those cases, the filter is not expected to handle line
  54 breaks in arbitrary places inside a markup element, but only inside certain
  55 fields.
  56
  57 This format assumes that the symbolizing filter processes a coherent stream of
  58 log lines from a single process address space context. If a logging stream
  59 interleaves log lines from more than one process, these must be collated into
  60 separate per-process log streams and each stream processed by a separate
  61 instance of the symbolizing filter. Because the kernel and user processes use
  62 disjoint address regions in most operating systems, a single user process
  63 address space plus the kernel address space can be treated as a single address
  64 space for symbolization purposes if desired.
  65
  66 Dependence on Build IDs
  67 =======================
  68
  69 The symbolizer markup scheme relies on contextual information about runtime
  70 memory address layout to make it possible to convert markup elements into useful
  71 symbolic form. This relies on having an unmistakable identification of which
  72 binary was loaded at each address.
  73
  74 An ELF Build ID is the payload of an ELF note with name ``"GNU"`` and type
  75 ``NT_GNU_BUILD_ID``, a unique byte sequence that identifies a particular binary
  76 (executable, shared library, loadable module, or driver module). The linker
  77 generates this automatically based on a hash that includes the complete symbol
  78 table and debugging information, even if this is later stripped from the binary.
  79
  80 This specification uses the ELF Build ID as the sole means of identifying
  81 binaries. Each binary relevant to the log must have been linked with a unique
  82 Build ID. The symbolizing filter must have some means of mapping a Build ID back
  83 to the original ELF binary (either the whole unstripped binary, or a stripped
  84 binary paired with a separate debug file).
  85
  86 Colorization
  87 ============
  88
  89 The markup format supports a restricted subset of ANSI X3.64 SGR (Select Graphic
  90 Rendition) control sequences. These are unlike other markup elements:
  91
  92 * They specify presentation details (bold or colors) rather than semantic
  93   information. The association of semantic meaning with color (e.g. red for
  94   errors) is chosen by the code doing the logging, rather than by the UI
  95   presentation of the symbolizing filter. This is a concession to existing code
  96   (e.g. LLVM sanitizer runtimes) that use specific colors and would require
  97   substantial changes to generate semantic markup instead.
  98
  99 * A single control sequence changes "the state", rather than being an
 100   hierarchical structure that surrounds affected text.
 101
 102 The filter processes ANSI SGR control sequences only within a single line. If a
 103 control sequence to enter a bold or color state is encountered, it's expected
 104 that the control sequence to reset to default state will be encountered before
 105 the end of that line. If a "dangling" state is left at the end of a line, the
 106 filter may reset to default state for the next line.
 107
 108 An SGR control sequence is not interpreted inside any other markup element.
 109 However, other markup elements may appear between SGR control sequences and the
 110 color/bold state is expected to apply to the symbolic output that replaces the
 111 markup element in the filter's output.
 112
 113 The accepted SGR control sequences all have the form ``"\033[%um"`` (expressed here
 114 using C string syntax), where ``%u`` is one of these:
 115
 116 ==== ============================ ===============================================
 117 Code Effect                       Notes
 118 ==== ============================ ===============================================
 119 0    Reset to default formatting.
 120 1    Bold text                    Combines with color states, doesn't reset them.
 121 30   Black foreground
 122 31   Red foreground
 123 32   Green foreground
 124 33   Yellow foreground
 125 34   Blue foreground
 126 35   Magenta foreground
 127 36   Cyan foreground
 128 37   White foreground
 129 ==== ============================ ===============================================
 130
 131 Common markup element syntax
 132 ============================
 133
 134 All the markup elements share a common syntactic structure to facilitate simple
 135 matching and parsing code. Each element has the form::
 136
 137   {{{tag:fields}}}
 138
 139 ``tag`` identifies one of the element types described below, and is always a
 140 short alphabetic string that must be in lower case. The rest of the element
 141 consists of one or more fields. Fields are separated by ``:`` and cannot contain
 142 any ``:`` or ``}`` characters. How many fields must be or may be present and
 143 what they contain is specified for each element type.
 144
 145 No markup elements or ANSI SGR control sequences are interpreted inside the
 146 contents of a field.
 147
 148 Implementations must ignore markup fields after those expected; this allows
 149 adding new fields to backwards-compatibly extend elements. Implementations need
 150 not ignore them silently, but the element should behave otherwise as if the
 151 fields were removed.
 152
 153 In the descriptions of each element type, ``printf``-style placeholders indicate
 154 field contents:
 155
 156 ``%s``
 157   A string of printable characters, not including ``:`` or ``}``.
 158
 159 ``%p``
 160   An address value represented by ``0x`` followed by an even number of
 161   hexadecimal digits (using either lower-case or upper-case for ``A``–``F``).
 162   If the digits are all ``0`` then the ``0x`` prefix may be omitted. No more
 163   than 16 hexadecimal digits are expected to appear in a single value (64 bits).
 164
 165 ``%u``
 166   A nonnegative decimal integer.
 167
 168 ``%i``
 169   A nonnegative integer. The digits are hexadecimal if prefixed by ``0x``, octal
 170   if prefixed by ``0``, or decimal otherwise.
 171
 172 ``%x``
 173   A sequence of an even number of hexadecimal digits (using either lower-case or
 174   upper-case for ``A``–``F``), with no ``0x`` prefix. This represents an
 175   arbitrary sequence of bytes, such as an ELF Build ID.
 176
 177 Presentation elements
 178 =====================
 179
 180 These are elements that convey a specific program entity to be displayed in
 181 human-readable symbolic form.
 182
 183 ``{{{symbol:%s}}}``
 184   Here ``%s`` is the linkage name for a symbol or type. It may require
 185   demangling according to language ABI rules. Even for unmangled names, it's
 186   recommended that this markup element be used to identify a symbol name so that
 187   it can be presented distinctively.
 188
 189   Examples::
 190
 191     {{{symbol:_ZN7Mangled4NameEv}}}
 192     {{{symbol:foobar}}}
 193
 194 ``{{{pc:%p}}}``, ``{{{pc:%p:ra}}}``, ``{{{pc:%p:pc}}}``
 195
 196   Here ``%p`` is the memory address of a code location. It might be presented as a
 197   function name and source location. The second two forms distinguish the kind of
 198   code location, as described in detail for bt elements below.
 199
 200   Examples::
 201
 202     {{{pc:0x12345678}}}
 203     {{{pc:0xffffffff9abcdef0}}}
 204
 205 ``{{{data:%p}}}``
 206
 207   Here ``%p`` is the memory address of a data location. It might be presented as
 208   the name of a global variable at that location.
 209
 210   Examples::
 211
 212     {{{data:0x12345678}}}
 213     {{{data:0xffffffff9abcdef0}}}
 214
 215 ``{{{bt:%u:%p}}}``, ``{{{bt:%u:%p:ra}}}``, ``{{{bt:%u:%p:pc}}}``
 216
 217   This represents one frame in a backtrace. It usually appears on a line by
 218   itself (surrounded only by whitespace), in a sequence of such lines with
 219   ascending frame numbers. So the human-readable output might be formatted
 220   assuming that, such that it looks good for a sequence of bt elements each
 221   alone on its line with uniform indentation of each line. But it can appear
 222   anywhere, so the filter should not remove any non-whitespace text surrounding
 223   the element.
 224
 225   Here ``%u`` is the frame number, which starts at zero for the location of the
 226   fault being identified, increments to one for the caller of frame zero's call
 227   frame, to two for the caller of frame one, etc. ``%p`` is the memory address
 228   of a code location.
 229
 230   Code locations in a backtrace come from two distinct sources. Most backtrace
 231   frames describe a return address code location, i.e. the instruction
 232   immediately after a call instruction. This is the location of code that has
 233   yet to run, since the function called there has not yet returned. Hence the
 234   code location of actual interest is usually the call site itself rather than
 235   the return address, i.e. one instruction earlier. When presenting the source
 236   location for a return address frame, the symbolizing filter will subtract one
 237   byte or one instruction length from the actual return address for the call
 238   site, with the intent that the address logged can be translated directly to a
 239   source location for the call site and not for the apparent return site
 240   thereafter (which can be confusing).  When inlined functions are involved, the
 241   call site and the return site can appear to be in different functions at
 242   entirely unrelated source locations rather than just a line away, making the
 243   confusion of showing the return site rather the call site quite severe.
 244
 245   Often the first frame in a backtrace ("frame zero") identifies the precise
 246   code location of a fault, trap, or asynchronous interrupt rather than a return
 247   address. At other times, even the first frame is actually a return address
 248   (for example, backtraces collected at the time of an object allocation and
 249   reported later when the allocated object is used or misused). When a system
 250   supports in-thread trap handling, there may also be frames after the first
 251   that represent a precise interrupted code location rather than a return
 252   address, presented as the "caller" of a trap handler function (for example,
 253   signal handlers in POSIX systems).
 254
 255   Return address frames are identified by the ``:ra`` suffix. Precise code
 256   location frames are identified by the ``:pc`` suffix.
 257
 258   Traditional practice has often been to collect backtraces as simple address
 259   lists, losing the distinction between return address code locations and
 260   precise code locations. Some such code applies the "subtract one" adjustment
 261   described above to the address values before reporting them, and it's not
 262   always clear or consistent whether this adjustment has been applied or not.
 263   These ambiguous cases are supported by the ``bt`` and ``pc`` forms with no
 264   ``:ra`` or ``:pc`` suffix, which indicate it's unclear which sort of code
 265   location this is.  However, it's highly recommended that all emitters use the
 266   suffixed forms and deliver address values with no adjustments applied. When
 267   traditional practice has been ambiguous, the majority of cases seem to have
 268   been of printing addresses that are return address code locations and printing
 269   them without adjustment. So the symbolizing filter will usually apply the
 270   "subtract one byte" adjustment to an address printed without a disambiguating
 271   suffix. Assuming that a call instruction is longer than one byte on all
 272   supported machines, applying the "subtract one byte" adjustment a second time
 273   still results in an address somewhere in the call instruction, so a little
 274   sloppiness here often does little or no harm.
 275
 276   Examples::
 277
 278     {{{bt:0:0x12345678:pc}}}
 279     {{{bt:1:0xffffffff9abcdef0:ra}}}
 280
 281 ``{{{hexdict:...}}}`` [#not_yet_implemented]_
 282
 283   This element can span multiple lines. Here ``...`` is a sequence of key-value
 284   pairs where a single ``:`` separates each key from its value, and arbitrary
 285   whitespace separates the pairs. The value (right-hand side) of each pair
 286   either is one or more ``0`` digits, or is ``0x`` followed by hexadecimal
 287   digits. Each value might be a memory address or might be some other integer
 288   (including an integer that looks like a likely memory address but actually has
 289   an unrelated purpose). When the contextual information about the memory layout
 290   suggests that a given value could be a code location or a global variable data
 291   address, it might be presented as a source location or variable name or with
 292   active UI that makes such interpretation optionally visible.
 293
 294   The intended use is for things like register dumps, where the emitter doesn't
 295   know which values might have a symbolic interpretation but a presentation that
 296   makes plausible symbolic interpretations available might be very useful to
 297   someone reading the log. At the same time, a flat text presentation should
 298   usually avoid interfering too much with the original contents and formatting
 299   of the dump. For example, it might use footnotes with source locations for
 300   values that appear to be code locations. An active UI presentation might show
 301   the dump text as is, but highlight values with symbolic information available
 302   and pop up a presentation of symbolic details when a value is selected.
 303
 304   Example::
 305
 306     {{{hexdict:
 307         CS:                   0 RIP:     0x6ee17076fb80 EFL:            0x10246 CR2:                  0
 308         RAX:      0xc53d0acbcf0 RBX:     0x1e659ea7e0d0 RCX:                  0 RDX:     0x6ee1708300cc
 309         RSI:                  0 RDI:     0x6ee170830040 RBP:     0x3b13734898e0 RSP:     0x3b13734898d8
 310         R8:      0x3b1373489860 R9:          0x2776ff4f R10:     0x2749d3e9a940 R11:              0x246
 311         R12:     0x1e659ea7e0f0 R13: 0xd7231230fd6ff2e7 R14:     0x1e659ea7e108 R15:      0xc53d0acbcf0
 312       }}}
 313
 314 Trigger elements
 315 ================
 316
 317 These elements cause an external action and will be presented to the user in a
 318 human readable form. Generally they trigger an external action to occur that
 319 results in a linkable page. The link or some other informative information about
 320 the external action can then be presented to the user.
 321
 322 ``{{{dumpfile:%s:%s}}}`` [#not_yet_implemented]_
 323
 324   Here the first ``%s`` is an identifier for a type of dump and the second
 325   ``%s`` is an identifier for a particular dump that's just been published. The
 326   types of dumps, the exact meaning of "published", and the nature of the
 327   identifier are outside the scope of the markup format per se. In general it
 328   might correspond to writing a file by that name or something similar.
 329
 330   This element may trigger additional post-processing work beyond symbolizing
 331   the markup. It indicates that a dump file of some sort has been published.
 332   Some logic attached to the symbolizing filter may understand certain types of
 333   dump file and trigger additional post-processing of the dump file upon
 334   encountering this element (e.g. generating visualizations, symbolization). The
 335   expectation is that the information collected from contextual elements
 336   (described below) in the logging stream may be necessary to decode the content
 337   of the dump. So if the symbolizing filter triggers other processing, it may
 338   need to feed some distilled form of the contextual information to those
 339   processes.
 340
 341   An example of a type identifier is ``sancov``, for dumps from LLVM
 342   `SanitizerCoverage <https://clang.llvm.org/docs/SanitizerCoverage.html>`_.
 343
 344   Example::
 345
 346     {{{dumpfile:sancov:sancov.8675}}}
 347
 348 Contextual elements
 349 ===================
 350
 351 These are elements that supply information necessary to convert presentation
 352 elements to symbolic form. Unlike presentation elements, they are not directly
 353 related to the surrounding text. Contextual elements should appear alone on
 354 lines with no other non-whitespace text, so that the symbolizing filter might
 355 elide the whole line from its output without hiding any other log text.
 356
 357 The contextual elements themselves do not necessarily need to be presented in
 358 human-readable output. However, the information they impart may be essential to
 359 understanding the logging text even after symbolization. So it's recommended
 360 that this information be preserved in some form when the original raw log with
 361 markup may no longer be readily accessible for whatever reason.
 362
 363 Contextual elements should appear in the logging stream before they are needed.
 364 That is, if some piece of context may affect how the symbolizing filter would
 365 interpret or present a later presentation element, the necessary contextual
 366 elements should have appeared somewhere earlier in the logging stream. It should
 367 always be possible for the symbolizing filter to be implemented as a single pass
 368 over the raw logging stream, accumulating context and massaging text as it goes.
 369
 370 ``{{{reset}}}``
 371
 372   This should be output before any other contextual element. The need for this
 373   contextual element is to support implementations that handle logs coming from
 374   multiple processes. Such implementations might not know when a new process
 375   starts or ends. Because some identifying information (like process IDs) might
 376   be the same between old and new processes, a way is needed to distinguish two
 377   processes with such identical identifying information. This element informs
 378   such implementations to reset the state of a filter so that information from a
 379   previous process's contextual elements is not assumed for new process that
 380   just happens have the same identifying information.
 381
 382 ``{{{module:%i:%s:%s:...}}}``
 383
 384   This element represents a so-called "module". A "module" is a single linked
 385   binary, such as a loaded ELF file. Usually each module occupies a contiguous
 386   range of memory.
 387
 388   Here ``%i`` is the module ID which is used by other contextual elements to
 389   refer to this module. The first ``%s`` is a human-readable identifier for the
 390   module, such as an ELF ``DT_SONAME`` string or a file name; but it might be
 391   empty. It's only for casual information. Only the module ID is used to refer
 392   to this module in other contextual elements, never the ``%s`` string. The
 393   ``module`` element defining a module ID must always be emitted before any
 394   other elements that refer to that module ID, so that a filter never needs to
 395   keep track of dangling references. The second ``%s`` is the module type and it
 396   determines what the remaining fields are. The following module types are
 397   supported:
 398
 399   * ``elf:%x``
 400
 401   Here ``%x`` encodes an ELF Build ID. The Build ID should refer to a single
 402   linked binary. The Build ID string is the sole way to identify the binary from
 403   which this module was loaded.
 404
 405   Example::
 406
 407     {{{module:1:libc.so:elf:83238ab56ba10497}}}
 408
 409 ``{{{mmap:%p:%i:...}}}``
 410
 411   This contextual element is used to give information about a particular region
 412   in memory. ``%p`` is the starting address and ``%i`` gives the size in hex of the
 413   region of memory. The ``...`` part can take different forms to give different
 414   information about the specified region of memory. The allowed forms are the
 415   following:
 416
 417   * ``load:%i:%s:%p``
 418
 419   This subelement informs the filter that a segment was loaded from a module.
 420   The module is identified by its module ID ``%i``. The ``%s`` is one or more of
 421   the letters 'r', 'w', and 'x' (in that order and in either upper or lower
 422   case) to indicate this segment of memory is readable, writable, and/or
 423   executable. The symbolizing filter can use this information to guess whether
 424   an address is a likely code address or a likely data address in the given
 425   module. The remaining ``%p`` gives the module relative address. For ELF files
 426   the module relative address will be the ``p_vaddr`` of the associated program
 427   header. For example if your module's executable segment has
 428   ``p_vaddr=0x1000``, ``p_memsz=0x1234``, and was loaded at ``0x7acba69d5000``
 429   then you need to subtract ``0x7acba69d4000`` from any address between
 430   ``0x7acba69d5000`` and ``0x7acba69d6234`` to get the module relative address.
 431   The starting address will usually have been rounded down to the active page
 432   size, and the size rounded up.
 433
 434   Example::
 435
 436     {{{mmap:0x7acba69d5000:0x5a000:load:1:rx:0x1000}}}
 437
 438 .. rubric:: Footnotes
 439
 440 .. [#not_yet_implemented] This markup element is not yet implemented in
 441   :doc:`llvm-symbolizer <CommandGuide/llvm-symbolizer>`.