lldb/docs/resources/dataformatters.rst

   1 Data Formatters
   2 ===============
   3
   4 This page is an introduction to the design of the LLDB data formatters
   5 subsystem. The intended target audience are people interested in understanding
   6 or modifying the formatters themselves rather than writing a specific data
   7 formatter. For the latter, refer to :doc:`/use/variable/`.
   8
   9 This page also highlights some open areas for improvement to the general
  10 subsystem, and more evolutions not anticipated here are certainly possible.
  11
  12 Overview
  13 --------
  14
  15 The LLDB data formatters subsystem is used to allow the debugger as well as the
  16 end-users to customize the way their variables look upon inspection in the user
  17 interface (be it the command line tool, or one of the several GUIs that are
  18 backed by LLDB).
  19
  20 To this aim, they are hooked into the ``ValueObjects`` model, in order to
  21 provide entry points through which such customization questions can be
  22 answered. For example: What format should this number be printed as? How many
  23 child elements does this ``std::vector`` have?
  24
  25 The architecture of the subsystem is layered, with the highest level layer
  26 being the user visible interaction features (e.g. the ``type ***`` commands,
  27 the SB classes, ...). Other layers of interest that will be analyzed in this
  28 document include:
  29
  30 * Classes implementing individual data formatter types
  31 * Classes implementing formatters navigation, discovery and categorization
  32 * The ``FormatManager`` layer
  33 * The ``DataVisualization`` layer
  34 * The SWIG <> LLDB communication layer
  35
  36 Data Formatter Types
  37 --------------------
  38
  39 As described in the user documentation, there are four types of formatters:
  40
  41 * Formats
  42 * Summaries
  43 * Filters
  44 * Synthetic children
  45
  46 Formatters have descriptor classes, ``Type*Impl``, which contain at least a
  47 "Flags" nested object, which contains both rules to be used by the matching
  48 algorithm (e.g. should the formatter for type Foo apply to a Foo*?) or rules to
  49 be used by the formatter itself (e.g. is this summary a oneliner?).
  50
  51 Individual formatter descriptor classes then also contain data items useful to
  52 them for performing their functionality. For instance ``TypeFormatImpl``
  53 (backing formats) contains an ``lldb::Format`` that is the format to then be
  54 applied were this formatter to be selected. Upon issuing a ``type format add``
  55 a new ``TypeFormatImpl`` is created that wraps the user-specified format, and
  56 matching options:
  57
  58 ::
  59
  60   entry.reset(new TypeFormatImpl(
  61       format, TypeFormatImpl::Flags()
  62                   .SetCascades(m_command_options.m_cascade)
  63                   .SetSkipPointers(m_command_options.m_skip_pointers)
  64                   .SetSkipReferences(m_command_options.m_skip_references)));
  65
  66
  67 While formats are fairly simple and only implemented by one class, the other
  68 formatter types are backed by a class hierarchy.
  69
  70 Summaries, for instance, can exist in one of three "flavors":
  71
  72 * Summary strings
  73 * Python script
  74 * Native C++
  75
  76 The base class for summaries, ``TypeSummaryImpl``, is a pure virtual class that
  77 wraps, again, the Flags, and exports among others:
  78
  79 ::
  80
  81   virtual bool FormatObject (ValueObject *valobj, std::string& dest) = 0;
  82
  83
  84 This is the core entry point, which allows subclasses to specify their mode of
  85 operation.
  86
  87 ``StringSummaryFormat``, which is the class that implements summary strings,
  88 does a check as to whether the summary is a one-liner, and if not, then uses
  89 its stored summary string to call into ``Debugger::FormatPrompt``, and obtain a
  90 string back, which it returns in ``dest`` as the resulting summary.
  91
  92 For a Python summary, implemented in ``ScriptSummaryFormat``,
  93 ``FormatObject()`` calls into the ``ScriptInterpreter`` which is supposed to
  94 hold the knowledge on how to bridge back and forth with the scripting language
  95 (Python in the case of LLDB) in order to produce a valid string. Implementors
  96 of new ``ScriptInterpreters`` for other languages are expected to provide a
  97 ``GetScriptedSummary()`` entry point for this purpose, if they desire to allow
  98 users to provide formatters in the new language
  99
 100 Lastly, C++ summaries (``CXXFunctionSummaryFormat``), wrap a function pointer
 101 and call into it to execute their duty. It should be noted that there are no
 102 facilities for users to interact with C++ formatters, and as such they are
 103 extremely opaque, effectively being a thin wrapper between plain function
 104 pointers and the LLDB formatters subsystem.
 105
 106 Also, dynamic loading of C++ formatters in LLDB is currently not implemented,
 107 and as such it is safe and reasonable for these formatters to deal with
 108 internal ``ValueObjects`` instances instead of public ``SBValue`` objects.
 109
 110 An interesting data point is that summaries are expected to be stateless. While
 111 at the Python layer they are handed an ``SBValue`` (since nothing else could be
 112 visible for scripts), it is not expected that the ``SBValue`` should be cached
 113 and reused - any and all caching occurs on the LLDB side, completely
 114 transparent to the formatter itself.
 115
 116 The design of synthetic children is somewhat more intricate, due to them being
 117 stateful objects. The core idea of the design is that synthetic children act
 118 like a two-tier model, in which there is a backend dataset (the underlying
 119 unformatted ``ValueObject``), and an higher level view (frontend) which vends
 120 the computed representation.
 121
 122 To implement a new type of synthetic children one would implement a subclass of
 123 ``SyntheticChildren``, which akin to the ``TypeFormatImpl``, contains Flags for
 124 matching, and data items to be used for formatting. For instance,
 125 ``TypeFilterImpl`` (which implements filters), stores the list of expression
 126 paths of the children to be displayed.
 127
 128 Filters are themselves synthetic children. Since all they do is provide child
 129 values for a ``ValueObject``, it does not truly matter whether these come from the
 130 real set of children or are crafted through some intricate algorithm. As such,
 131 they perfectly fit within the realm of synthetic children and are only shown as
 132 separate entities for user friendliness (to a user, picking a subset of
 133 elements to be shown with relative ease is a valuable task, and they should not
 134 be concerned with writing scripts to do so).
 135
 136 Once the descriptor of the synthetic children has been coded, in order to hook
 137 it up, one has to implement a subclass of ``SyntheticChildrenFrontEnd``. For a
 138 given type of synthetic children, there is a deep coupling with the matching
 139 front-end class, given that the front-end usually needs data stored in the
 140 descriptor (e.g. a filter needs the list of child elements).
 141
 142 The front-end answers the interesting questions that are the true raison d'être
 143 of synthetic children:
 144
 145 ::
 146
 147   virtual size_t CalculateNumChildren () = 0;
 148   virtual lldb::ValueObjectSP GetChildAtIndex (size_t idx) = 0;
 149   virtual size_t GetIndexOfChildWithName (const ConstString &name) = 0;
 150   virtual bool Update () = 0;
 151   virtual bool MightHaveChildren () = 0;
 152
 153 Synthetic children providers (their front-ends) will be queried by LLDB for a
 154 number of children, and then for each of them as necessary, they should be
 155 prepared to return a ``ValueObject`` describing the child. They might also be
 156 asked to provide a name-to-index mapping (e.g. to allow LLDB to resolve queries
 157 like ``myFoo.myChild``).
 158
 159 ``Update()`` and ``MightHaveChildren()`` are described in the user
 160 documentation, and they mostly serve bookkeeping purposes.
 161
 162 LLDB provides three kinds of synthetic children: filters, scripted synthetics,
 163 and the native C++ providers Filters are implemented by
 164 ``TypeFilterImpl::FrontEnd``.
 165
 166 Scripted synthetics are implemented by ``ScriptedSyntheticChildren::FrontEnd``,
 167 plus a set of callbacks provided by the ``ScriptInterpteter`` infrastructure to
 168 allow LLDB to pass the front-end queries down to the scripting languages.
 169
 170 As for C++ native synthetics, there is a ``CXXSyntheticChildren``, but no
 171 corresponding ``FrontEnd`` class. The reason for this design is that
 172 ``CXXSyntheticChildren`` store a callback to a creator function, which is
 173 responsible for providing a ``FrontEnd``. Each individual formatter (e.g.
 174 ``LibstdcppMapIteratorSyntheticFrontEnd``) is a standalone frontend, and once
 175 created retains to relation to its underlying ``SyntheticChildren`` object.
 176
 177 On a ``ValueObject`` level, upon being asked to generate synthetic children for
 178 a ``ValueObject``, LLDB spawns a ValueObjectSynthetic object which is a
 179 subclass of ``ValueObject``. Building upon the ``ValueObject`` infrastructure,
 180 it stores a backend, and a shared pointer to the ``SyntheticChildren``. Upon
 181 being asked queries about children, it will use the ``SyntheticChildren`` to
 182 generate a front-end for itself and will let the front-end answer questions.
 183 The reason for not storing the ``FrontEnd`` itself is that there is no
 184 guarantee that across updates, the same ``FrontEnd`` will be used over and over
 185 (e.g. a ``SyntheticChildren`` object could serve an entire class hierarchy and
 186 vend different frontends for different subclasses).
 187
 188 Formatters Matching
 189 -------------------
 190
 191 The problem of formatters matching is going from "I have a ``ValueObject``" to
 192 "these are the formatters to be used for it."
 193
 194 There is a rather intricate set of user rules that are involved, and a rather
 195 intricate implementation of this model. All of these relate to the type of the
 196 ``ValueObject``. It is assumed that types are a strong enough contract that it
 197 is possible to format an object entirely depending on its type. If this turns
 198 out to not be correct, then the existing model will have to be changed fairly
 199 deeply.
 200
 201 The basic building block is that formatters can match by exact type name or by
 202 regular expressions, i.e. one can describe matching by saying things like "this
 203 formatters matches type ``__NSDictionaryI``", or "this formatter matches all
 204 type names like ``^std::__1::vector<.+>(( )?&)?$``."
 205
 206 This match happens in class ``FormattersContainer``. For exact matches, this
 207 goes straight to the ``FormatMap`` (the actual storage area for formatters),
 208 whereas for regular expression matches the regular expression is matched
 209 against the provided candidate type name. If one were to introduce a new type
 210 of matching (say, match against number of ``$`` signs present in the typename,
 211 ``FormattersContainer`` is the place where such a change would have to be
 212 introduced).
 213
 214 It should be noted that this code involves template specialization, and as such
 215 is somewhat trickier than other formatters code to update.
 216
 217 On top of the string matching mechanism (exact or regex), there are a set of
 218 more advanced rules implemented by the ``FormattersContainer``, with the aid of the
 219 ``FormattersMatchCandidate``. Namely, it is assumed that any formatter class will
 220 have flags to say whether it allows cascading (i.e. seeing through typedefs),
 221 allowing pointers-to-object and reference-to-object to be formatted. Upon
 222 verifying that a formatter would be a textual match, the Flags are checked, and
 223 if they do not allow the formatter to be used (e.g. pointers are not allowed,
 224 and one is looking at a Foo*), then the formatter is rejected and the search
 225 continues. If the flags also match, then the formatter is returned upstream and
 226 the search is over.
 227
 228 One relevant fact to notice is that this entire mechanism is not dependent on
 229 the kind of formatter to be returned, which makes it easier to devise new types
 230 of formatters as the lowest layers of the system. The demands on individual
 231 formatters are that they define a few typedefs, and export a Flags object, and
 232 then they can be freely matched against types as needed.
 233
 234 This mechanism is replicated across a number of categories. A category is a
 235 named bucket where formatters are grouped on some basis. The most common reason
 236 for a category to exist is a library (e.g. ``libcxx`` formatters vs. ``libstdcpp``
 237 formatters). Categories can be enabled or disabled, and they have a priority
 238 number, called position. The priority sets a strong order among enabled
 239 categories. A category named "default" is always the highest priority one and
 240 it's the category where all formatters that do not ask for a category of their
 241 own end up (e.g. ``type summary add ....`` without a ``w somecategory`` flag
 242 passed) The algorithm inquires each category, in the order of their priorities,
 243 for a formatter for a type, and upon receiving a positive answer from a
 244 category, ends the search. Of course, no search occurs in disabled categories.
 245
 246 At the individual category level, there is the first dependence on the type of
 247 formatter to be returned. Since both filters and synthetic children proper are
 248 implemented through the same backing store, the matching code needs to ensure
 249 that, were both a synthetic children provider and a filter to match a type,
 250 only the most recently added one is actually used. The details of the algorithm
 251 used are to be found in ``TypeCategoryImpl::Get()``.
 252
 253 It is quite obvious, even to a casual reader, that there are a number of
 254 complexities involved in this algorithm. For starters, the entire search
 255 process has to be repeated for every variable. Moreover, for each category, one
 256 has to repeat the entire process of crawling the types (go to pointee, ...).
 257 This is exactly the algorithm initially implemented by LLDB. Over the course of
 258 the life of the formatters subsystem, two main evolutions have been made to the
 259 matching mechanism:
 260
 261 * A caching mechanism
 262 * A pregeneration of all possible type matches
 263
 264 The cache is a layer that sits between the ``FormatManager`` and the
 265 ``TypeCategoryMap``. Upon being asked to figure out a formatter, the ``FormatManager``
 266 will first query the cache layer, and only if that fails, will the categories
 267 be queried using the full search algorithm. The result of that full search will
 268 then be stored in the cache. Even a negative answer (no formatter) gets stored.
 269 The negative answer is actually the most beneficial to cache as obtaining it
 270 requires traversing all possible formatters in all categories just to get a
 271 no-op back.
 272
 273 Of course, once an answer is cached, getting it will be much quicker than going
 274 to a full category search, as the cached answers are of the form "type foo" -->
 275 "formatter bar". But given how formatters can be edited or removed by the user,
 276 either at the command line or via the API, there needs to be a way to
 277 invalidate the cache.
 278
 279 This happens through the ``FormatManager::Changed()`` method. In general, anything
 280 that changes the formatters causes ``FormatManager::Changed()`` to be called
 281 through the ``IFormatChangeListener`` interface. This call increases the
 282 ``FormatManager``'s revision and clears the cache. The revision number is a
 283 monotonically increasing integer counter that essentially corresponds to the
 284 number of changes made to the formatters throughout the current LLDB session.
 285 This counter is used by ``ValueObjects`` to know when their formatters are out of
 286 date. Since a search is a potentially expensive operation, before caching was
 287 introduced, individual ``ValueObjects`` remembered which revision of the
 288 ``FormatManager`` they used to search for their formatter, and stored it, so that
 289 they would not repeat the search unless a change in the formatters had
 290 occurred. While caching has made this less critical of an optimization, it is
 291 still sensible and thus is kept.
 292
 293 Lastly, as a side note, it is worth highlighting that any change in the
 294 formatters invalidates the entire cache. It would likely not be impossible to
 295 be smarter and figure out a subset of cache entries to be deleted, letting
 296 others persist, instead of having to rebuild the entire cache from scratch.
 297 However, given that formatters are not that frequently changed during a debug
 298 session, and the algorithmic complexity to "get it right" seems larger than the
 299 potential benefit to be had from doing it, the full cache invalidation is the
 300 chosen policy. The algorithm to selectively invalidate entries is probably one
 301 of the major areas for improvements in formatters performance.
 302
 303 The second major optimization, introduced fairly recently, is the pregeneration
 304 of type matches. The original algorithm was based upon the notion of a
 305 ``FormatNavigator`` as a smart object, aware of all the intricacies of the
 306 matching rules. For each category, the ``FormatNavigator`` would generate the
 307 possible matches (e.g. dynamic type, pointee type, ...), and check each one,
 308 one at a time. If that failed for a category, the next one would again generate
 309 the same matches.
 310
 311 This worked well, but was of course inefficient. The
 312 ``FormattersMatchCandidate`` is the solution to this performance issue. In
 313 top-of-tree LLDB, the ``FormatManager`` has the centralized notion of the
 314 matching rules, and the former ``FormatNavigators`` are now
 315 ``FormattersContainers``, whose only job is to guarantee a centralized storage
 316 of formatters, and thread-safe access to such storage.
 317
 318 ``FormatManager::GetPossibleMatches()`` fills a vector of possible matches. The
 319 way it works is by applying each rule, generating the corresponding typename,
 320 and storing the typename, plus the required Flags for that rule to be accepted
 321 as a match candidate (e.g. if the match comes by fetching the pointee type, a
 322 formatter that matches will have to allow pointees as part of its Flags
 323 object). The ``TypeCategoryMap``, when tasked with finding a formatter for a
 324 type, generates all possible matches and passes them down to each category. In
 325 this model, the type system only does its (expensive) job once, and textual or
 326 regex matches are the core of the work.
 327
 328 FormatManager and DataVisualization
 329 -----------------------------------
 330
 331 There are two main entry points in the data formatters: the ``FormatManager`` and
 332 the ``DataVisualization``.
 333
 334 The ``FormatManager`` is the internal such entry point. In this context,
 335 internal refers to data formatters code itself, compared to other parts of
 336 LLDB. For other components of the debugger, the ``DataVisualization`` provides
 337 a more stable entry point. On the other hand, the ``FormatManager`` is an
 338 aggregator of all moving parts, and as such is less stable in the face of
 339 refactoring.
 340
 341 People involved in the data formatters code itself, however, will most likely
 342 have to confront the ``FormatManager`` for significant architecture changes.
 343
 344 The ``FormatManager`` wraps a ``TypeCategoryMap`` (the list of all existing
 345 categories, enabled and not), the ``FormatCache``, and several utility objects.
 346 Plus, it is the repository of named summaries, since these don't logically
 347 belong anywhere else.
 348
 349 It is also responsible for creating all builtin formatters upon the launch of
 350 LLDB. It does so through a bunch of methods ``Load***Formatters()``, invoked as
 351 part of its constructor. The original design of data formatters anticipated
 352 that individual libraries would load their formatters as part of their debug
 353 information. This work however has largely been left unattended in practice,
 354 and as such core system libraries (mostly those for masOS/iOS development as of
 355 today) load their formatters in an hardcoded fashion.
 356
 357 For performance reasons, the ``FormatManager`` is constructed upon being first
 358 required. This happens through the ``DataVisualization`` layer. Upon first
 359 being inquired for anything formatters, ``DataVisualization`` calls its own
 360 local static function ``GetFormatManager()``, which in turns constructs and
 361 returns a local static ``FormatManager``.
 362
 363 Unlike most things in LLDB, the lifetime of the ``FormatManager`` is the same
 364 as the entire session, rather than a specific ``Debugger`` or ``Target``
 365 instance. This is an area to be improved, but as of now it has not caused
 366 enough grief to warrant action. If this work were to be undertaken, one could
 367 conceivably devise a per-architecture-triple model, upon the assumption that an
 368 OS and CPU combination are a good enough key to decide which formatters apply
 369 (e.g. Linux i386 is probably different from masOS x86_64, but two macOS x86_64
 370 targets will probably have the same formatters; of course versioning of the
 371 underlying OS is also to be considered, but experience with OSX has shown that
 372 formatters can take care of that internally in most cases of interest).
 373
 374 The public entry point is the ``DataVisualization`` layer.
 375 ``DataVisualization`` is a static class on which questions can be asked in a
 376 relatively refactoring-safe manner.
 377
 378 The main question asked of it is to obtain formatters for ``ValueObjects`` (or
 379 typenames). One can also query ``DataVisualization`` for named summaries or
 380 individual categories, but of course those queries delve deeper in the internal
 381 object model.
 382
 383 As said, the ``FormatManager`` holds a notion of revision number, which changes
 384 every time formatters are edited (added, deleted, categories enabled or
 385 disabled, ...). Through ``DataVisualization::ForceUpdate()`` one can cause the
 386 same effects of a formatters edit to happen without it actually having
 387 happened.
 388
 389 The main reason for this feature is that formatters can be dynamically created
 390 in Python, and one can then enter the ``ScriptInterpreter`` and edit the
 391 formatter function or class. If formatters were not updated, one could find
 392 them to be out of sync with the new definitions of these objects. To avoid the
 393 issue, whenever the user exits the scripting mode, formatters force an update
 394 to make sure new potential definitions are reloaded on demand.
 395
 396 The SWIG Layer
 397 --------------
 398
 399 In order to implement formatters written in Python, LLDB requires that
 400 ``ScriptInterpreter`` implementations provide a set of functions that one can call
 401 to ask formatting questions of scripts.
 402
 403 For instance, in order to obtain a scripting summary, LLDB calls:
 404
 405 ::
 406
 407   virtual bool
 408   GetScriptedSummary(const char *function_name, llldb::ValueObjectSP valobj,
 409                      lldb::ScriptInterpreterObjectSP &callee_wrapper_sp,
 410                      std::string &retval)
 411
 412
 413 For Python, this function is implemented by first checking if the
 414 ``callee_wrapper_sp`` is valid. If so, LLDB knows that it does not need to
 415 search a function with the passed name, and can directly call the wrapped
 416 Python function object. Either way, the call is routed to a global callback
 417 ``g_swig_typescript_callback``.
 418
 419 This callback pointer points to ``LLDBSwigPythonCallTypeScript``. The details
 420 of the implementation require familiarity with the Python C API, plus a few
 421 utility objects defined by LLDB to ease the burden of dealing with the
 422 scripting world. However, as a sketch of what happens, the code tries to find a
 423 Python function object with the given name (i.e. if you say ``type summary add
 424 -F module.function`` LLDB will scan for the ``module`` module, and then for a
 425 function named ``function`` inside the module's namespace). If the function
 426 object is found, it is wrapped in a ``PyCallable``, which is an LLDB utility class
 427 that wraps the callable and allows for easier calling. The callable gets
 428 invoked, and the return value, if any, is cast into a string. Originally, if a
 429 non-string object was returned, LLDB would refuse to use it. This disallowed
 430 such simple construct as:
 431
 432 ::
 433
 434   def getSummary(value,*args):
 435     return 1
 436
 437 Similar considerations apply to other formatter (and non-formatter related)
 438 scripting callbacks.