1 ==========================
2 Symbolizer Markup Format
3 ==========================
11 This document defines a text format for log messages that can be processed by a
12 symbolizing filter. The basic idea is that logging code emits text that contains
13 raw address values and so forth, without the logging code doing any real work to
14 convert those values to human-readable form. Instead, logging text uses the
15 markup format defined here to identify pieces of information that should be
16 converted to human-readable form after the fact. As with other markup formats,
17 the expectation is that most of the text will be displayed as is, while the
18 markup elements will be replaced with expanded text, or converted into active UI
19 elements, that present more details in symbolic form.
21 This means there is no need for symbol tables, DWARF debugging sections, or
22 similar information to be directly accessible at runtime. There is also no need
23 at runtime for any logic intended to compute human-readable presentation of
24 information, such as C++ symbol demangling. Instead, logging must include markup
25 elements that give the contextual information necessary to make sense of the raw
26 data, such as memory layout details.
28 This format identifies markup elements with a syntax that is both simple and
29 distinctive. It's simple enough to be matched and parsed with straightforward
30 code. It's distinctive enough that character sequences that look like the start
31 or end of a markup element should rarely if ever appear incidentally in logging
32 text. It's specifically intended not to require sanitizing plain text, such as
33 the HTML/XML requirement to replace ``<`` with ``<`` and the like.
35 :manpage:`llvm-symbolizer(1)` includes a symbolizing filter via its ``--filter``
41 A symbolizing filter implementation will be independent both of the target
42 operating system and machine architecture where the logs are generated and of
43 the host operating system and machine architecture where the filter runs.
45 This format assumes that the symbolizing filter processes intact whole lines. If
46 long lines might be split during some stage of a logging pipeline, they must be
47 reassembled to restore the original line breaks before feeding lines into the
48 symbolizing filter. Most markup elements must appear entirely on a single line
49 (often with other text before and/or after the markup element). There are some
50 markup elements that are specified to span lines, with line breaks in the middle
51 of the element. Even in those cases, the filter is not expected to handle line
52 breaks in arbitrary places inside a markup element, but only inside certain
55 This format assumes that the symbolizing filter processes a coherent stream of
56 log lines from a single process address space context. If a logging stream
57 interleaves log lines from more than one process, these must be collated into
58 separate per-process log streams and each stream processed by a separate
59 instance of the symbolizing filter. Because the kernel and user processes use
60 disjoint address regions in most operating systems, a single user process
61 address space plus the kernel address space can be treated as a single address
62 space for symbolization purposes if desired.
64 Dependence on Build IDs
65 =======================
67 The symbolizer markup scheme relies on contextual information about runtime
68 memory address layout to make it possible to convert markup elements into useful
69 symbolic form. This relies on having an unmistakable identification of which
70 binary was loaded at each address.
72 An ELF Build ID is the payload of an ELF note with name ``"GNU"`` and type
73 ``NT_GNU_BUILD_ID``, a unique byte sequence that identifies a particular binary
74 (executable, shared library, loadable module, or driver module). The linker
75 generates this automatically based on a hash that includes the complete symbol
76 table and debugging information, even if this is later stripped from the binary.
78 This specification uses the ELF Build ID as the sole means of identifying
79 binaries. Each binary relevant to the log must have been linked with a unique
80 Build ID. The symbolizing filter must have some means of mapping a Build ID back
81 to the original ELF binary (either the whole unstripped binary, or a stripped
82 binary paired with a separate debug file).
87 The markup format supports a restricted subset of ANSI X3.64 SGR (Select Graphic
88 Rendition) control sequences. These are unlike other markup elements:
90 * They specify presentation details (bold or colors) rather than semantic
91 information. The association of semantic meaning with color (e.g. red for
92 errors) is chosen by the code doing the logging, rather than by the UI
93 presentation of the symbolizing filter. This is a concession to existing code
94 (e.g. LLVM sanitizer runtimes) that use specific colors and would require
95 substantial changes to generate semantic markup instead.
97 * A single control sequence changes "the state", rather than being an
98 hierarchical structure that surrounds affected text.
100 The filter processes ANSI SGR control sequences only within a single line. If a
101 control sequence to enter a bold or color state is encountered, it's expected
102 that the control sequence to reset to default state will be encountered before
103 the end of that line. If a "dangling" state is left at the end of a line, the
104 filter may reset to default state for the next line.
106 An SGR control sequence is not interpreted inside any other markup element.
107 However, other markup elements may appear between SGR control sequences and the
108 color/bold state is expected to apply to the symbolic output that replaces the
109 markup element in the filter's output.
111 The accepted SGR control sequences all have the form ``"\033[%um"`` (expressed here
112 using C string syntax), where ``%u`` is one of these:
114 ==== ============================ ===============================================
116 ==== ============================ ===============================================
117 0 Reset to default formatting.
118 1 Bold text Combines with color states, doesn't reset them.
124 35 Magenta foreground
127 ==== ============================ ===============================================
129 Common markup element syntax
130 ============================
132 All the markup elements share a common syntactic structure to facilitate simple
133 matching and parsing code. Each element has the form::
137 ``tag`` identifies one of the element types described below, and is always a
138 short alphabetic string that must be in lower case. The rest of the element
139 consists of one or more fields. Fields are separated by ``:`` and cannot contain
140 any ``:`` or ``}`` characters. How many fields must be or may be present and
141 what they contain is specified for each element type.
143 No markup elements or ANSI SGR control sequences are interpreted inside the
146 In the descriptions of each element type, ``printf``-style placeholders indicate
150 A string of printable characters, not including ``:`` or ``}``.
153 An address value represented by ``0x`` followed by an even number of
154 hexadecimal digits (using either lower-case or upper-case for ``A``–``F``).
155 If the digits are all ``0`` then the ``0x`` prefix may be omitted. No more
156 than 16 hexadecimal digits are expected to appear in a single value (64 bits).
159 A nonnegative decimal integer.
162 A nonnegative integer. The digits are hexadecimal if prefixed by ``0x``, octal
163 if prefixed by ``0``, or decimal otherwise.
166 A sequence of an even number of hexadecimal digits (using either lower-case or
167 upper-case for ``A``–``F``), with no ``0x`` prefix. This represents an
168 arbitrary sequence of bytes, such as an ELF Build ID.
170 Presentation elements
171 =====================
173 These are elements that convey a specific program entity to be displayed in
174 human-readable symbolic form.
177 Here ``%s`` is the linkage name for a symbol or type. It may require
178 demangling according to language ABI rules. Even for unmangled names, it's
179 recommended that this markup element be used to identify a symbol name so that
180 it can be presented distinctively.
184 {{{symbol:_ZN7Mangled4NameEv}}}
187 ``{{{pc:%p}}}``, ``{{{pc:%p:ra}}}``, ``{{{pc:%p:pc}}}``
189 Here ``%p`` is the memory address of a code location. It might be presented as a
190 function name and source location. The second two forms distinguish the kind of
191 code location, as described in detail for bt elements below.
196 {{{pc:0xffffffff9abcdef0}}}
200 Here ``%p`` is the memory address of a data location. It might be presented as
201 the name of a global variable at that location.
205 {{{data:0x12345678}}}
206 {{{data:0xffffffff9abcdef0}}}
208 ``{{{bt:%u:%p}}}``, ``{{{bt:%u:%p:ra}}}``, ``{{{bt:%u:%p:pc}}}`` [#not_yet_implemented]_
210 This represents one frame in a backtrace. It usually appears on a line by
211 itself (surrounded only by whitespace), in a sequence of such lines with
212 ascending frame numbers. So the human-readable output might be formatted
213 assuming that, such that it looks good for a sequence of bt elements each
214 alone on its line with uniform indentation of each line. But it can appear
215 anywhere, so the filter should not remove any non-whitespace text surrounding
218 Here ``%u`` is the frame number, which starts at zero for the location of the
219 fault being identified, increments to one for the caller of frame zero's call
220 frame, to two for the caller of frame one, etc. ``%p`` is the memory address
223 Code locations in a backtrace come from two distinct sources. Most backtrace
224 frames describe a return address code location, i.e. the instruction
225 immediately after a call instruction. This is the location of code that has
226 yet to run, since the function called there has not yet returned. Hence the
227 code location of actual interest is usually the call site itself rather than
228 the return address, i.e. one instruction earlier. When presenting the source
229 location for a return address frame, the symbolizing filter will subtract one
230 byte or one instruction length from the actual return address for the call
231 site, with the intent that the address logged can be translated directly to a
232 source location for the call site and not for the apparent return site
233 thereafter (which can be confusing). When inlined functions are involved, the
234 call site and the return site can appear to be in different functions at
235 entirely unrelated source locations rather than just a line away, making the
236 confusion of showing the return site rather the call site quite severe.
238 Often the first frame in a backtrace ("frame zero") identifies the precise
239 code location of a fault, trap, or asynchronous interrupt rather than a return
240 address. At other times, even the first frame is actually a return address
241 (for example, backtraces collected at the time of an object allocation and
242 reported later when the allocated object is used or misused). When a system
243 supports in-thread trap handling, there may also be frames after the first
244 that represent a precise interrupted code location rather than a return
245 address, presented as the "caller" of a trap handler function (for example,
246 signal handlers in POSIX systems).
248 Return address frames are identified by the ``:ra`` suffix. Precise code
249 location frames are identified by the ``:pc`` suffix.
251 Traditional practice has often been to collect backtraces as simple address
252 lists, losing the distinction between return address code locations and
253 precise code locations. Some such code applies the "subtract one" adjustment
254 described above to the address values before reporting them, and it's not
255 always clear or consistent whether this adjustment has been applied or not.
256 These ambiguous cases are supported by the ``bt`` and ``pc`` forms with no
257 ``:ra`` or ``:pc`` suffix, which indicate it's unclear which sort of code
258 location this is. However, it's highly recommended that all emitters use the
259 suffixed forms and deliver address values with no adjustments applied. When
260 traditional practice has been ambiguous, the majority of cases seem to have
261 been of printing addresses that are return address code locations and printing
262 them without adjustment. So the symbolizing filter will usually apply the
263 "subtract one byte" adjustment to an address printed without a disambiguating
264 suffix. Assuming that a call instruction is longer than one byte on all
265 supported machines, applying the "subtract one byte" adjustment a second time
266 still results in an address somewhere in the call instruction, so a little
267 sloppiness here often does little or no harm.
271 {{{bt:0:0x12345678:pc}}}
272 {{{bt:1:0xffffffff9abcdef0:ra}}}
274 ``{{{hexdict:...}}}`` [#not_yet_implemented]_
276 This element can span multiple lines. Here ``...`` is a sequence of key-value
277 pairs where a single ``:`` separates each key from its value, and arbitrary
278 whitespace separates the pairs. The value (right-hand side) of each pair
279 either is one or more ``0`` digits, or is ``0x`` followed by hexadecimal
280 digits. Each value might be a memory address or might be some other integer
281 (including an integer that looks like a likely memory address but actually has
282 an unrelated purpose). When the contextual information about the memory layout
283 suggests that a given value could be a code location or a global variable data
284 address, it might be presented as a source location or variable name or with
285 active UI that makes such interpretation optionally visible.
287 The intended use is for things like register dumps, where the emitter doesn't
288 know which values might have a symbolic interpretation but a presentation that
289 makes plausible symbolic interpretations available might be very useful to
290 someone reading the log. At the same time, a flat text presentation should
291 usually avoid interfering too much with the original contents and formatting
292 of the dump. For example, it might use footnotes with source locations for
293 values that appear to be code locations. An active UI presentation might show
294 the dump text as is, but highlight values with symbolic information available
295 and pop up a presentation of symbolic details when a value is selected.
300 CS: 0 RIP: 0x6ee17076fb80 EFL: 0x10246 CR2: 0
301 RAX: 0xc53d0acbcf0 RBX: 0x1e659ea7e0d0 RCX: 0 RDX: 0x6ee1708300cc
302 RSI: 0 RDI: 0x6ee170830040 RBP: 0x3b13734898e0 RSP: 0x3b13734898d8
303 R8: 0x3b1373489860 R9: 0x2776ff4f R10: 0x2749d3e9a940 R11: 0x246
304 R12: 0x1e659ea7e0f0 R13: 0xd7231230fd6ff2e7 R14: 0x1e659ea7e108 R15: 0xc53d0acbcf0
310 These elements cause an external action and will be presented to the user in a
311 human readable form. Generally they trigger an external action to occur that
312 results in a linkable page. The link or some other informative information about
313 the external action can then be presented to the user.
315 ``{{{dumpfile:%s:%s}}}`` [#not_yet_implemented]_
317 Here the first ``%s`` is an identifier for a type of dump and the second
318 ``%s`` is an identifier for a particular dump that's just been published. The
319 types of dumps, the exact meaning of "published", and the nature of the
320 identifier are outside the scope of the markup format per se. In general it
321 might correspond to writing a file by that name or something similar.
323 This element may trigger additional post-processing work beyond symbolizing
324 the markup. It indicates that a dump file of some sort has been published.
325 Some logic attached to the symbolizing filter may understand certain types of
326 dump file and trigger additional post-processing of the dump file upon
327 encountering this element (e.g. generating visualizations, symbolization). The
328 expectation is that the information collected from contextual elements
329 (described below) in the logging stream may be necessary to decode the content
330 of the dump. So if the symbolizing filter triggers other processing, it may
331 need to feed some distilled form of the contextual information to those
334 An example of a type identifier is ``sancov``, for dumps from LLVM
335 `SanitizerCoverage <https://clang.llvm.org/docs/SanitizerCoverage.html>`_.
339 {{{dumpfile:sancov:sancov.8675}}}
344 These are elements that supply information necessary to convert presentation
345 elements to symbolic form. Unlike presentation elements, they are not directly
346 related to the surrounding text. Contextual elements should appear alone on
347 lines with no other non-whitespace text, so that the symbolizing filter might
348 elide the whole line from its output without hiding any other log text.
350 The contextual elements themselves do not necessarily need to be presented in
351 human-readable output. However, the information they impart may be essential to
352 understanding the logging text even after symbolization. So it's recommended
353 that this information be preserved in some form when the original raw log with
354 markup may no longer be readily accessible for whatever reason.
356 Contextual elements should appear in the logging stream before they are needed.
357 That is, if some piece of context may affect how the symbolizing filter would
358 interpret or present a later presentation element, the necessary contextual
359 elements should have appeared somewhere earlier in the logging stream. It should
360 always be possible for the symbolizing filter to be implemented as a single pass
361 over the raw logging stream, accumulating context and massaging text as it goes.
365 This should be output before any other contextual element. The need for this
366 contextual element is to support implementations that handle logs coming from
367 multiple processes. Such implementations might not know when a new process
368 starts or ends. Because some identifying information (like process IDs) might
369 be the same between old and new processes, a way is needed to distinguish two
370 processes with such identical identifying information. This element informs
371 such implementations to reset the state of a filter so that information from a
372 previous process's contextual elements is not assumed for new process that
373 just happens have the same identifying information.
375 ``{{{module:%i:%s:%s:...}}}``
377 This element represents a so-called "module". A "module" is a single linked
378 binary, such as a loaded ELF file. Usually each module occupies a contiguous
381 Here ``%i`` is the module ID which is used by other contextual elements to
382 refer to this module. The first ``%s`` is a human-readable identifier for the
383 module, such as an ELF ``DT_SONAME`` string or a file name; but it might be
384 empty. It's only for casual information. Only the module ID is used to refer
385 to this module in other contextual elements, never the ``%s`` string. The
386 ``module`` element defining a module ID must always be emitted before any
387 other elements that refer to that module ID, so that a filter never needs to
388 keep track of dangling references. The second ``%s`` is the module type and it
389 determines what the remaining fields are. The following module types are
394 Here ``%x`` encodes an ELF Build ID. The Build ID should refer to a single
395 linked binary. The Build ID string is the sole way to identify the binary from
396 which this module was loaded.
400 {{{module:1:libc.so:elf:83238ab56ba10497}}}
402 ``{{{mmap:%p:%i:...}}}``
404 This contextual element is used to give information about a particular region
405 in memory. ``%p`` is the starting address and ``%i`` gives the size in hex of the
406 region of memory. The ``...`` part can take different forms to give different
407 information about the specified region of memory. The allowed forms are the
412 This subelement informs the filter that a segment was loaded from a module.
413 The module is identified by its module ID ``%i``. The ``%s`` is one or more of
414 the letters 'r', 'w', and 'x' (in that order and in either upper or lower
415 case) to indicate this segment of memory is readable, writable, and/or
416 executable. The symbolizing filter can use this information to guess whether
417 an address is a likely code address or a likely data address in the given
418 module. The remaining ``%p`` gives the module relative address. For ELF files
419 the module relative address will be the ``p_vaddr`` of the associated program
420 header. For example if your module's executable segment has
421 ``p_vaddr=0x1000``, ``p_memsz=0x1234``, and was loaded at ``0x7acba69d5000``
422 then you need to subtract ``0x7acba69d4000`` from any address between
423 ``0x7acba69d5000`` and ``0x7acba69d6234`` to get the module relative address.
424 The starting address will usually have been rounded down to the active page
425 size, and the size rounded up.
429 {{{mmap:0x7acba69d5000:0x5a000:load:1:rx:0x1000}}}
431 .. rubric:: Footnotes
433 .. [#not_yet_implemented] This markup element is not yet implemented in
434 :manpage:`llvm-symbolizer(1)`.