dhat/docs/dh-manual.xml

   1 <?xml version="1.0"?> <!-- -*- sgml -*- -->
   2 <!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.5//EN"
   3           "http://www.oasis-open.org/docbook/xml/4.5/docbookx.dtd"
   4 [ <!ENTITY % vg-entities SYSTEM "../../docs/xml/vg-entities.xml"> %vg-entities; ]>
   5
   6
   7 <chapter id="dh-manual"
   8          xreflabel="DHAT: a dynamic heap analysis tool">
   9   <title>DHAT: a dynamic heap analysis tool</title>
  10
  11 <para>To use this tool, you must specify
  12 <option>--tool=dhat</option> on the Valgrind command line.</para>
  13
  14
  15
  16 <sect1 id="dh-manual.overview" xreflabel="Overview">
  17 <title>Overview</title>
  18
  19 <para>DHAT is primarily a tool for examining how programs use their heap
  20 allocations.</para>
  21
  22 <para>It tracks the allocated blocks, and inspects every memory access
  23 to find which block, if any, it is to. It presents, on a program point
  24 basis, information about these blocks such as sizes, lifetimes, numbers of
  25 reads and writes, and read and write patterns.</para>
  26
  27 <para>Using this information it is possible to identify program points with
  28 the following characteristics:</para>
  29
  30 <itemizedlist>
  31
  32   <listitem><para>potential process-lifetime leaks: blocks allocated
  33    by the point just accumulate, and are freed only at the end of the
  34    run.</para></listitem>
  35
  36  <listitem><para>excessive turnover: points which chew through a lot
  37   of heap, even if it is not held onto for very long</para></listitem>
  38
  39  <listitem><para>excessively transient: points which allocate very
  40  short lived blocks</para></listitem>
  41
  42  <listitem><para>useless or underused allocations: blocks which are
  43   allocated but not completely filled in, or are filled in but not
  44   subsequently read.</para></listitem>
  45
  46  <listitem><para>blocks with inefficient layout -- areas never
  47   accessed, or with hot fields scattered throughout the
  48   block.</para></listitem>
  49 </itemizedlist>
  50
  51 <para>As with the Massif heap profiler, DHAT measures program progress
  52 by counting instructions, and so presents all age/time related figures
  53 as instruction counts. This sounds a little odd at first, but it
  54 makes runs repeatable in a way which is not possible if CPU time is
  55 used.</para>
  56
  57 <para>DHAT also has support for copy profiling and ad hoc profiling. These are
  58 described below.</para>
  59
  60 </sect1>
  61
  62
  63
  64 <sect1 id="dh-manual.profile" xreflabel="Using DHAT">
  65 <title>Using DHAT</title>
  66
  67 <para>First off, as for normal Valgrind use, you probably want to compile with
  68 debugging info (the <option>-g</option> option). But by contrast with normal
  69 Valgrind use, you probably do want to turn optimisation on, since you should
  70 profile your program as it will be normally run.</para>
  71
  72 <para>Second, you need to run your program under DHAT to gather the profiling
  73 information. You might need to reduce the <option>--num-callers</option> value
  74 to get reasonably-sized output files, especially if you are profiling a large
  75 program; some trial and error might be needed to find a good value.</para>
  76
  77 <para>Finally, you need to use DHAT's viewer (in a web browser) to get a
  78 detailed presentation of that information.</para>
  79
  80
  81 <sect2 id="dh-manual.running-DHAT" xreflabel="Running DHAT">
  82 <title>Running DHAT</title>
  83
  84 <para>To run DHAT on a program <filename>prog</filename>, run:</para>
  85 <screen><![CDATA[
  86 valgrind --tool=dhat prog
  87 ]]></screen>
  88
  89 <para>The program will execute (slowly). Upon completion, summary statistics
  90 that look like this will be printed:</para>
  91
  92 <programlisting><![CDATA[
  93 ==11514== Total:     823,849,731 bytes in 3,929,133 blocks
  94 ==11514== At t-gmax: 133,485,082 bytes in 436,521 blocks
  95 ==11514== At t-end:  258,002 bytes in 2,129 blocks
  96 ==11514== Reads:     2,807,182,810 bytes
  97 ==11514== Writes:    1,149,617,086 bytes
  98 ]]></programlisting>
  99
 100 <para>The first line shows how many heap blocks and bytes were allocated over
 101 the entire execution.</para>
 102
 103 <para>The second line shows how many heap blocks and bytes were alive at
 104 <computeroutput>t-gmax</computeroutput>, i.e. the time when the heap size
 105 reached its global maximum (as measured in bytes).</para>
 106
 107 <para>The third line shows how many heap blocks and bytes were alive at
 108 <computeroutput>t-end</computeroutput>, i.e. the end of execution. In other
 109 words, how many blocks and bytes were not explicitly freed. </para>
 110
 111 <para>The fourth and fifth lines show how many bytes within heap blocks were
 112 read and written during the entire execution. </para>
 113
 114 <para>These lines are moderately interesting at best. More useful information
 115 can be seen with DHAT's viewer.</para>
 116
 117 </sect2>
 118
 119
 120 <sect2 id="dh-manual.outputfile" xreflabel="Output File">
 121 <title>Output File</title>
 122
 123 <para>As well as printing summary information, DHAT also writes more detailed
 124 profiling information to a file. By default this file is named
 125 <filename>dhat.out.&lt;pid&gt;</filename> (where
 126 <filename>&lt;pid&gt;</filename> is the program's process ID), but its name can
 127 be changed with the <option>--dhat-out-file</option> option. This file is JSON,
 128 and intended to be viewed by DHAT's viewer, which is described in the next
 129 section.</para>
 130
 131 <para>The default <computeroutput>.&lt;pid&gt;</computeroutput> suffix on the
 132 output file name serves two purposes. Firstly, it means you don't have to
 133 rename old log files that you don't want to overwrite. Secondly, and more
 134 importantly, it allows correct profiling with the
 135 <option>--trace-children=yes</option> option of programs that spawn child
 136 processes.</para>
 137
 138 <para>The output file can be big, many megabytes for large applications
 139 built with full debugging information.</para>
 140
 141 </sect2>
 142
 143 </sect1>
 144
 145
 146
 147 <sect1 id="dh-manual.viewer" xreflabel="DHAT's viewer">
 148 <title>DHAT's Viewer</title>
 149
 150 <para>DHAT's viewer can be run in a web browser by loading the file
 151 <computeroutput>dh_view.html</computeroutput>. Use the "Load" button to choose
 152 a DHAT output file to view.</para>
 153
 154 <para>If loading takes a long time, it might be worth re-running DHAT with a
 155 smaller <option>--num-callers</option> value to reduce the stack depths,
 156 because this can significantly reduce the size of DHAT's output files.</para>
 157
 158
 159 <sect2 id="dh-output-header"><title>The Output Header</title>
 160
 161 <para>The first part of the output shows the mode, program command and process
 162 ID. For example:</para>
 163
 164 <programlisting><![CDATA[
 165 Invocation {
 166   Mode:    heap
 167   Command: /home/njn/moz/rust0/build/x86_64-unknown-linux-gnu/stage2/bin/rustc --crate-name tuple_stress src/main.rs
 168   PID:     18816
 169 }
 170 ]]></programlisting>
 171
 172 <para>The second part of the output shows the
 173 <computeroutput>t-gmax</computeroutput> and
 174 <computeroutput>t-end</computeroutput> values again. For example:</para>
 175
 176 <programlisting><![CDATA[
 177 Times {
 178   t-gmax: 8,138,210,673 instrs (86.92% of program duration)
 179   t-end:  9,362,544,994 instrs
 180 }
 181 ]]></programlisting>
 182
 183 </sect2>
 184
 185
 186 <sect2 id="dh-ap-tree"><title>The PP Tree</title>
 187
 188 <para>The third part of the output is the largest and most interesting part,
 189 showing the program point (PP) tree.</para>
 190
 191
 192 <sect3 id="dh-structure"><title>Structure</title>
 193
 194 <para>The following image shows a screenshot of part of a PP
 195 tree. The font is very small because this screenshot is intended to
 196 demonstrate the high-level structure of the tree rather than the
 197 details within the text. (It is also slightly out-of-date, and doesn't quite
 198 match the current output produced by DHAT's viewer.)</para>
 199
 200 <graphic fileref="images/dh-tree.png" scalefit="1"/>
 201
 202 <para>Like any tree, it has a root node, leaf nodes, and non-leaf nodes. The
 203 structure of the tree is shown by the lines connecting nodes. Child nodes are
 204 beneath their parent and indented one level.</para>
 205
 206 <para>The sub-trees beneath a non-leaf node can be collapsed or expanded by
 207 clicking on the node. It is useful to collapse sub-trees that you aren't
 208 interested in.</para>
 209
 210 <para>Colours are meaningful, and are intended to ease tree navigation, but the
 211 information they represent is also present within the text. (This means that
 212 colour-blind users are not denied any information.)</para>
 213
 214 <para>Each leaf node is coloured green. Each non-leaf node is coloured blue
 215 and has a down arrow (<computeroutput>▼</computeroutput>) next to it when
 216 its sub-tree is expanded. Each non-leaf node is coloured yellow and has a
 217 left arrow (<computeroutput>▶</computeroutput>) next to it when its sub-tree
 218 is collapsed.</para>
 219
 220 <para>The shade of green, blue or yellow used for a node indicate its
 221 significance. Darker shades represent greater significance (in terms of bytes
 222 or blocks).</para>
 223
 224 <para>Note that the entire output is text, even the arrows and lines connecting
 225 nodes. This means you can copy and paste any part of the output easily into an
 226 email, bug report, etc.</para>
 227
 228 </sect3>
 229
 230
 231 <sect3 id="dh-root-node"><title>The Root Node</title>
 232
 233 <para>The root node looks like this:</para>
 234
 235 <programlisting><![CDATA[
 236 PP 1/1 (25 children) {
 237   Total:     1,355,253,987 bytes (100%, 67,454.81/Minstr) in 5,943,417 blocks (100%, 295.82/Minstr), avg size 228.03 bytes, avg lifetime 3,134,692,250.67 instrs (15.6% of program duration)
 238   At t-gmax: 423,930,307 bytes (100%) in 1,575,682 blocks (100%), avg size 269.05 bytes
 239   At t-end:  258,002 bytes (100%) in 2,129 blocks (100%), avg size 121.18 bytes
 240   Reads:     5,478,606,988 bytes (100%, 272,685.7/Minstr), 4.04/byte
 241   Writes:    2,040,294,800 bytes (100%, 101,551.22/Minstr), 1.51/byte
 242   Allocated at {
 243     #0: [root]
 244   }
 245 }
 246 ]]></programlisting>
 247
 248 <para>The root node covers the entire execution. The information is a superset
 249 of the information shown when DHAT ran, adding details such as allocation
 250 rates, average block sizes, block lifetimes, and read and write ratios. The
 251 next example will explain these in more detail.</para>
 252
 253 </sect3>
 254
 255
 256 <sect3 id="dh-interior-nodes"><title>Interior Nodes</title>
 257
 258 <para>PP nodes further down the tree show information about a subset of
 259 allocations. For example:</para>
 260
 261 <programlisting><![CDATA[
 262 PP 1.1/25 (2 children) {
 263   Total:     54,533,440 bytes (4.02%, 2,714.28/Minstr) in 458,839 blocks (7.72%, 22.84/Minstr), avg size 118.85 bytes, avg lifetime 1,127,259,403.64 instrs (5.61% of program duration)
 264   At t-gmax: 0 bytes (0%) in 0 blocks (0%), avg size 0 bytes
 265   At t-end:  0 bytes (0%) in 0 blocks (0%), avg size 0 bytes
 266   Reads:     15,993,012 bytes (0.29%, 796.02/Minstr), 0.29/byte
 267   Writes:    20,974,752 bytes (1.03%, 1,043.97/Minstr), 0.38/byte
 268   Allocated at {
 269     #1: 0x95CACC9: alloc (alloc.rs:72)
 270     #2: 0x95CACC9: alloc (alloc.rs:148)
 271     #3: 0x95CACC9: reserve_internal<syntax::tokenstream::TokenStream,alloc::alloc::Global> (raw_vec.rs:669)
 272     #4: 0x95CACC9: reserve<syntax::tokenstream::TokenStream,alloc::alloc::Global> (raw_vec.rs:492)
 273     #5: 0x95CACC9: reserve<syntax::tokenstream::TokenStream> (vec.rs:460)
 274     #6: 0x95CACC9: push<syntax::tokenstream::TokenStream> (vec.rs:989)
 275     #7: 0x95CACC9: parse_token_trees_until_close_delim (tokentrees.rs:27)
 276     #8: 0x95CACC9: syntax::parse::lexer::tokentrees::<impl syntax::parse::lexer::StringReader<'a>>::parse_token_tree (tokentrees.rs:81)
 277   }
 278 }
 279 ]]></programlisting>
 280
 281 <para>The first line indicates the node's position in the tree. The
 282 <computeroutput>1.1</computeroutput> is a unique identifier for the node and
 283 also says that it is the first child node <computeroutput>1</computeroutput>
 284 (which is the root). The <computeroutput>/25</computeroutput> says that it is
 285 one of 25 children, i.e. it has 24 siblings. The <computeroutput>(2
 286 children)</computeroutput> says that this node node has two children of its
 287 own.</para>
 288
 289 <para>Allocations are aggregated by their allocation stack trace. The
 290 <computeroutput>Allocated at</computeroutput> section shows the allocation
 291 stack trace that is shared by all the blocks covered by this node.</para>
 292
 293 <para>The <computeroutput>Total</computeroutput> line shows that this node
 294 accounts for 4.02% of all bytes allocated during execution, and 7.72% of all
 295 blocks. These percentages are useful for comparing the significance of
 296 different nodes within a single profile; a PP that accounts for 10% of bytes
 297 allocated is likely to be more interesting than one that accounts for
 298 2%.</para>
 299
 300 <para>The <computeroutput>Total</computeroutput> line also shows allocation
 301 rates, measured in bytes and blocks per million instructions. These rates are
 302 useful for comparing the significance of nodes across profiles made with
 303 different workloads.</para>
 304
 305 <para>Finally, the <computeroutput>Total</computeroutput> line shows the
 306 average size and lifetimes of these blocks.</para>
 307
 308 <para>The <computeroutput>At t-gmax</computeroutput> line says shows that no
 309 blocks from this PP were alive when the global heap peak occurred. In other
 310 words, these blocks do not contribute at all to the global heap peak.</para>
 311
 312 <para>The <computeroutput>At t-end</computeroutput> line shows that no blocks
 313 were from this PP were alive at shutdown. In other words, all those blocks were
 314 explicitly freed before termination.</para>
 315
 316 <para>The <computeroutput>Reads</computeroutput> and
 317 <computeroutput>Writes</computeroutput> lines show how many bytes were read
 318 within this PP's blocks, the fraction this represents of all heap reads, and
 319 the read rate. Finally, it shows the read ratio, which is the number of reads
 320 per byte. In this case the number is 0.29, which is quite low -- if no byte was
 321 read twice, then only 29% of the allocated bytes, which means that at least 71%
 322 of the bytes were never read! This suggests that the blocks are being
 323 underutilized and might be worth optimizing.</para>
 324
 325 <para>The <computeroutput>Writes</computeroutput> lines is similar to the
 326 <computeroutput>Reads</computeroutput> line. In this case, at most 38% of the
 327 bytes are ever written, and at least 62% of the bytes were never written.
 328 </para>
 329
 330 <para>The <computeroutput>Reads</computeroutput> and
 331 <computeroutput>Writes</computeroutput> measurements suggest that the blocks
 332 are being under-utilised and might be worth optimizing. Having said that, this
 333 kind of under-utilisation is common in data structures that grow, such as
 334 vectors and hash tables, and isn't always fixable. </para>
 335
 336 </sect3>
 337
 338
 339 <sect3 id="dh-leaf-nodes"><title>Leaf Nodes</title>
 340
 341 <para>This is a leaf node:</para>
 342
 343 <programlisting><![CDATA[
 344 PP 1.1.1.1/2 {
 345   Total:     31,460,928 bytes (2.32%, 1,565.9/Minstr) in 262,171 blocks (4.41%, 13.05/Minstr), avg size 120 bytes, avg lifetime 986,406,885.05 instrs (4.91% of program duration)
 346   Max:       16,779,136 bytes in 65,543 blocks, avg size 256 bytes
 347   At t-gmax: 0 bytes (0%) in 0 blocks (0%), avg size 0 bytes
 348   At t-end:  0 bytes (0%) in 0 blocks (0%), avg size 0 bytes
 349   Reads:     5,964,704 bytes (0.11%, 296.88/Minstr), 0.19/byte
 350   Writes:    10,487,200 bytes (0.51%, 521.98/Minstr), 0.33/byte
 351   Allocated at {
 352     ^1: 0x95CACC9: alloc (alloc.rs:72)
 353     ^2: 0x95CACC9: alloc (alloc.rs:148)
 354     ^3: 0x95CACC9: reserve_internal<syntax::tokenstream::TokenStream,alloc::alloc::Global> (raw_vec.rs:669)
 355     ^4: 0x95CACC9: reserve<syntax::tokenstream::TokenStream,alloc::alloc::Global> (raw_vec.rs:492)
 356     ^5: 0x95CACC9: reserve<syntax::tokenstream::TokenStream> (vec.rs:460)
 357     ^6: 0x95CACC9: push<syntax::tokenstream::TokenStream> (vec.rs:989)
 358     ^7: 0x95CACC9: parse_token_trees_until_close_delim (tokentrees.rs:27)
 359     ^8: 0x95CACC9: syntax::parse::lexer::tokentrees::<impl syntax::parse::lexer::StringReader<'a>>::parse_token_tree (tokentrees.rs:81)
 360     ^9: 0x95CAC39: parse_token_trees_until_close_delim (tokentrees.rs:26)
 361     ^10: 0x95CAC39: syntax::parse::lexer::tokentrees::<impl syntax::parse::lexer::StringReader<'a>>::parse_token_tree (tokentrees.rs:81)
 362     #11: 0x95CAC39: parse_token_trees_until_close_delim (tokentrees.rs:26)
 363     #12: 0x95CAC39: syntax::parse::lexer::tokentrees::<impl syntax::parse::lexer::StringReader<'a>>::parse_token_tree (tokentrees.rs:81)
 364   }
 365 }
 366 ]]></programlisting>
 367
 368 <para>The <computeroutput>1.1.1.1/2</computeroutput> indicates that this node
 369 is a great-grandchild of the root; is the first grandchild of the node in the
 370 previous example; and has no children.</para>
 371
 372 <para>Leaf nodes contain an additional <computeroutput>Max</computeroutput>
 373 line, indicating the peak memory use for the blocks covered by this PP. (This
 374 peak may have occurred at a time other than
 375 <computeroutput>t-gmax</computeroutput>.) In this case, 31,460,298 bytes were
 376 allocated from this PP, but the maximum size alive at once was 16,779,136
 377 bytes.</para>
 378
 379 <para>Stack frames that begin with a <computeroutput>^</computeroutput> rather
 380 than a <computeroutput>#</computeroutput> are copied from ancestor nodes.
 381 (In this example, the first 8 frames are identical to those from the node in
 382 the previous example.) These frames could be found by tracing back through
 383 ancestor nodes, but that can be annoying, which is why they are duplicated.
 384 This also means that each node makes complete sense on its own.</para>
 385
 386 </sect3>
 387
 388
 389 <sect3 id="dh-access-counts"><title>Access Counts</title>
 390
 391 <para>If all blocks covered by a PP node have the same size, an additional
 392 <computeroutput>Accesses</computeroutput> field will be present. It indicates
 393 how the reads and writes within these blocks were distributed. For
 394 example:</para>
 395
 396 <programlisting><![CDATA[
 397 Total:     8,388,672 bytes (0.62%, 417.53/Minstr) in 262,146 blocks (4.41%, 13.05/Minstr), avg size 32 bytes, avg lifetime 16,726,078,401.51 instrs (83.25% of program duration)
 398 At t-gmax: 8,388,672 bytes (1.98%) in 262,146 blocks (16.64%), avg size 32 bytes
 399 At t-end:  0 bytes (0%) in 0 blocks (0%), avg size 0 bytes
 400 Reads:     9,109,682 bytes (0.17%, 453.41/Minstr), 1.09/byte
 401 Writes:    7,340,088 bytes (0.36%, 365.34/Minstr), 0.88/byte
 402 Accesses: {
 403   [  0]  65547 7 8 4 65529 〃 〃 〃 16 〃 〃 〃 12 〃 〃 〃 〃 〃 〃 〃 〃 〃 〃 〃 65542 〃 〃 〃 - - - -
 404 }
 405 ]]></programlisting>
 406
 407 <para>Every block covered by this PP was 32 bytes. Within all of those blocks,
 408 byte 0 was accessed (read or written) 65,547 times, byte 1 was accessed 7
 409 times, byte 2 was accessed 8 times, and so on.</para>
 410
 411 <para>The ditto symbol (<computeroutput>〃</computeroutput>) means "same access
 412 count as the previous byte".</para>
 413
 414 <para>A dash (<computeroutput>-</computeroutput>) means "zero". (It is used
 415 instead of <computeroutput>0</computeroutput> because it makes unaccessed
 416 regions more easily identifiable.)</para>
 417
 418 <para>The infinity symbol (<computeroutput>∞</computeroutput>, not present in
 419 this example) means "exceeded the maximum tracked count".</para>
 420
 421 <para>Block layout can often be inferred from counts. For example, these blocks
 422 probably have four separate byte-sized fields, followed by a four-byte field,
 423 and so on.</para>
 424
 425 <para>The size of the blocks that measure and display access counts is limited
 426 to 1024 bytes. This is done to limit the performance overhead and also to keep
 427 the size of the generated output reasonable. However, it is possible to override
 428 this limit using client requests. The use-case for this is to first run DHAT
 429 normally, and then identify any large blocks that you would like to further
 430 investigate with access count histograms. The client request is declared in
 431 <filename>dhat/dhat.h</filename> and is called <computeroutput>DHAT_HISTOGRAM_MEMORY</computeroutput>.
 432 The macro should be placed immediately after the call to the allocator,
 433 and use the pointer returned by the allocator.</para>
 434
 435 <programlisting><![CDATA[
 436 // LargeStruct bigger than 1024 bytes
 437 struct LargeStruct* ls = malloc(sizeof(struct LargeStruct));
 438 DHAT_HISTOGRAM_MEMORY(ls);
 439 ]]></programlisting>
 440
 441 <para>The memory that can be profiled in this way with user requests
 442 has a further upper limit of 25kbytes.  Be aware that the access counts
 443 will all be set to zero. This means that the access counts will not
 444 include any reads or writes performed during initialisation. An example where this
 445 will happen are uses of C++ <computeroutput>new</computeroutput> with user-defined constructors.</para>
 446
 447 <para>Access counts can be useful for identifying data alignment holes or other
 448 layout inefficiencies.</para>
 449
 450 </sect3>
 451
 452
 453 <sect3 id="aggregate-nodes"><title>Aggregate Nodes</title>
 454
 455 <para>The PP tree is very large and many nodes represent tiny numbers of blocks
 456 and bytes. Therefore, DHAT's viewer aggregates insignificant nodes like
 457 this:</para>
 458
 459 <programlisting><![CDATA[
 460 PP 1.14.2/2 {
 461   Total:     5,175 blocks (0.09%, 0.26/Minstr)
 462   Allocated at {
 463     [5 insignificant]
 464   }
 465 }
 466 ]]></programlisting>
 467
 468 <para>Much of the detail is stripped away, leaving only basic measurements,
 469 along with an indication of how many nodes were aggregated together (5 in this
 470 case).</para>
 471
 472 </sect3>
 473
 474 </sect2>
 475
 476
 477 <sect2 id="dh-output-footer"><title>The Output Footer</title>
 478
 479 <para>Below the PP tree is a line like this:</para>
 480
 481 <programlisting><![CDATA[
 482 PP significance threshold: total >= 59,434.17 blocks (1%)
 483 ]]></programlisting>
 484
 485 <para>It shows the function used to determine if a PP node is significant. All
 486 nodes that don't satisfy this function are aggregated. It is occasionally
 487 useful if you don't understand why a PP node has been aggregated. The exact
 488 threshold depends on the sort metric (see below).</para>
 489
 490 <para>Finally, the bottom of the page shows a legend that explains some of the
 491 terms, abbreviations and symbols used in the output.</para>
 492
 493 </sect2>
 494
 495
 496 <sect2 id="dh-sort-metrics"><title>Sort Metrics</title>
 497
 498 <para>The order in which sub-trees are sorted can be changed via the "Sort
 499 metric" drop-down menu at the top of DHAT's viewer. Different sort metrics can
 500 be useful for finding different things. Some sort metrics also incorporate some
 501 filtering, so that only nodes meeting a particular criteria are shown.</para>
 502
 503 <!-- start of xi:include in the manpage -->
 504 <variablelist>
 505
 506   <varlistentry>
 507     <term>Total (bytes)</term>
 508     <listitem><para>The total number of bytes allocated during the execution.
 509     Highly useful for evaluating heap churn, though not quite as useful as
 510     "Total (blocks)".
 511     </para></listitem>
 512   </varlistentry>
 513
 514   <varlistentry>
 515     <term>Total (blocks)</term>
 516     <listitem><para>The total number of blocks allocated during the execution.
 517     Highly useful for evaluating heap churn; reducing the number of calls to
 518     the allocator can significantly speed up a program. This is the default
 519     sort metric.
 520     </para></listitem>
 521   </varlistentry>
 522
 523   <varlistentry>
 524     <term>Total (blocks), tiny</term>
 525     <listitem><para>Like "Total (blocks)", but shows only very small blocks.
 526     Moderately useful, because such blocks are often easy to avoid allocating.
 527     </para></listitem>
 528   </varlistentry>
 529
 530   <varlistentry>
 531     <term>Total (blocks), short-lived</term>
 532     <listitem><para>Like "Total (blocks)", but shows only very short-lived
 533     blocks. Moderately useful, because such blocks are often easy to avoid
 534     allocating.
 535     </para></listitem>
 536   </varlistentry>
 537
 538   <varlistentry>
 539     <term>Total (bytes), zero reads or zero writes</term>
 540     <listitem><para>Like "Total (bytes)", but shows only blocks that are
 541     never read or never written to (or both). Highly useful, because such
 542     blocks indicate poor use of memory and are often easy to avoid allocating.
 543     For example, sometimes a block is allocated and written to but then only
 544     read if a condition C is true; in that case, it may be possible to delay
 545     creating the block until condition C is true. Alternatively, sometimes
 546     blocks are created and never used; such blocks are trivial to remove.
 547     </para></listitem>
 548   </varlistentry>
 549
 550   <varlistentry>
 551     <term>Total (blocks), zero reads or zero writes</term>
 552     <listitem><para>Like "Total (bytes), zero reads or zero writes" but for
 553     blocks. Highly useful.
 554     </para></listitem>
 555   </varlistentry>
 556
 557   <varlistentry>
 558     <term>Total (bytes), low-access</term>
 559     <listitem><para>Like "Total (bytes)", but shows only blocks that have low
 560     numbers of reads or low numbers of writes (or both). Moderately useful,
 561     because such blocks indicate poor use of memory.
 562     </para></listitem>
 563   </varlistentry>
 564
 565   <varlistentry>
 566     <term>Total (blocks), low-access</term>
 567     <listitem><para>Like "Total (bytes), low-access", but for blocks.
 568     </para></listitem>
 569   </varlistentry>
 570
 571   <varlistentry>
 572     <term>At t-gmax (bytes)</term>
 573     <listitem><para>This shows the breakdown of memory at the point of peak
 574     heap memory usage. Highly useful for reducing peak memory usage.
 575     </para></listitem>
 576   </varlistentry>
 577
 578   <varlistentry>
 579     <term>At t-end (bytes)</term>
 580     <listitem><para>This shows the breakdown of memory at program termination.
 581     Highly useful for identifying process-lifetime leaks.
 582     </para></listitem>
 583   </varlistentry>
 584
 585   <varlistentry>
 586     <term>Reads (bytes)</term>
 587     <listitem><para>The number of bytes read within heap blocks. Occasionally
 588     useful.
 589     </para></listitem>
 590   </varlistentry>
 591
 592   <varlistentry>
 593     <term>Reads (bytes), high-access</term>
 594     <listitem><para>Like "Reads (bytes)", but only shows blocks with high read
 595     ratios. Occasionally useful for identifying hot areas of memory.
 596     </para></listitem>
 597   </varlistentry>
 598
 599   <varlistentry>
 600     <term>Writes (bytes)</term>
 601     <listitem><para>Like "Reads (bytes)", but for writes. Occasionally useful.
 602     </para></listitem>
 603   </varlistentry>
 604
 605   <varlistentry>
 606     <term>Writes (bytes), high-access</term>
 607     <listitem><para>Like "Reads (bytes), high-access", but for writes.
 608     Occasionally useful.
 609     </para></listitem>
 610   </varlistentry>
 611
 612 </variablelist>
 613
 614 <para>The values within a node that represent the chosen sort metric are shown
 615 in bold, so they stand out.</para>
 616
 617 <para>Here is part of a PP node found with "Total (blocks), tiny", showing
 618 blocks with an average size of only 8.67 bytes:</para>
 619
 620 <programlisting><![CDATA[
 621 Total:     3,407,848 bytes (0.25%, 169.62/Minstr) in 393,214 blocks (6.62%, 19.57/Minstr), avg size 8.67 bytes, avg lifetime 1,167,795,629.1 instrs (5.81% of program duration)
 622 ]]></programlisting>
 623
 624 <para>Here is part of a PP node found with "Total (blocks), short-lived",
 625 showing blocks with an average lifetime of only 181.75 instructions:</para>
 626
 627 <programlisting><![CDATA[
 628 Total:     23,068,584 bytes (1.7%, 1,148.19/Minstr) in 262,143 blocks (4.41%, 13.05/Minstr), avg size 88 bytes, avg lifetime 181.75 instrs (0% of program duration)
 629 ]]></programlisting>
 630
 631 <para>Here is an example of a PP identified with "Total (blocks), zero reads
 632 or zero writes", showing blocks that are allocated but never touched:</para>
 633
 634 <programlisting><![CDATA[
 635 Total:     7,339,920 bytes (0.54%, 365.33/Minstr) in 262,140 blocks (4.41%, 13.05/Minstr), avg size 28 bytes, avg lifetime 1,141,103,997.69 instrs (5.68% of program duration)
 636 Max:       3,669,960 bytes in 131,070 blocks, avg size 28 bytes
 637 At t-gmax: 3,336,400 bytes (0.79%) in 119,157 blocks (7.56%), avg size 28 bytes
 638 At t-end:  0 bytes (0%) in 0 blocks (0%), avg size 0 bytes
 639 Reads:     0 bytes (0%, 0/Minstr), 0/byte
 640 Writes:    0 bytes (0%, 0/Minstr), 0/byte
 641 ]]></programlisting>
 642
 643 <para>All the blocks identified by these PPs are good candidates for
 644 optimization.</para>
 645
 646 </sect2>
 647
 648 </sect1>
 649
 650
 651 <sect1 id="dh-manual.realloc" xreflabel="Treatment of realloc">
 652 <title>Treatment of realloc</title>
 653
 654 <para><computeroutput>realloc</computeroutput> is a tricky function and there
 655 are several different ways that DHAT could handle it.</para>
 656
 657 <para>Imagine a <computeroutput>malloc(100)</computeroutput> call followed by
 658 a <computeroutput>realloc(200)</computeroutput> call. This combination is
 659 considered to add two to the total block count, and 300 bytes to the total
 660 bytes count. (An alternative would be to only add one to the total block
 661 count, and 200 bytes to the total bytes count, as if a single
 662 <computeroutput>malloc(200)</computeroutput> call had occurred. While this
 663 would be defensible from a semantic point of view, it is silly from an
 664 operational point of view, because making two calls to allocator functions is
 665 more expensive than one call, and DHAT is a profiler that aims to help with
 666 runtime costs.)</para>
 667
 668 <para>Furthermore, the implicit copying of the 100 bytes is added to the reads
 669 and writes counts. Without this, the read and write counts would be
 670 under-measured and misleading.</para>
 671
 672 <para>However, DHAT only increases the current heap size by 100 bytes for this
 673 combination, and does not change the current block count. (As opposed to
 674 increasing the current heap size by 200 bytes and then decreasing it by 100
 675 bytes.) As a result, it can only increase the global heap peak (if indeed,
 676 this results in a new peak) by 100 bytes.</para>
 677
 678 <para>Finally, the program point assigned to the block allocated by the
 679 <computeroutput>malloc(100)</computeroutput> call is retained once the block
 680 is reallocated. Which means that all 300 bytes are attributed to that
 681 program point, and no separate program point is created for the
 682 <computeroutput>realloc(200)</computeroutput> call. This may be surprising,
 683 but it has one large benefit.</para>
 684
 685 <para>Imagine some code that starts with an empty buffer, and then gradually
 686 adds data to that buffer from numerous different points in the code,
 687 reallocating the buffer each time it gets full. (E.g. code generation in a
 688 compiler might work this way.) With the described approach, the first heap
 689 block and all subsequent heap blocks are attributed to the same program point.
 690 While this is something of a lie -- the first program point isn't actually
 691 responsible for the other allocations -- it is arguably better than having the
 692 program points spread around in a distribution that unpredictably depends on
 693 whenever the reallocations were triggered.</para>
 694
 695 </sect1>
 696
 697
 698 <sect1 id="dh-manual.copy-profiling" xreflabel="Copy profiling">
 699 <title>Copy profiling</title>
 700
 701 <para>If DHAT is invoked with <option>--mode=copy</option>, instead of
 702 profiling heap operations (allocations and deallocations), it profiles copy
 703 operations, such as <computeroutput>memcpy</computeroutput>,
 704 <computeroutput>memmove</computeroutput>,
 705 <computeroutput>strcpy</computeroutput>, and
 706 <computeroutput>bcopy</computeroutput>. This is sometimes useful.</para>
 707
 708 <para>Here is an example PP node from this mode:</para>
 709
 710 <programlisting><![CDATA[
 711 PP 1.1.2/5 (4 children) {
 712   Total:     1,210,925 bytes (10.03%, 4,358.66/Minstr) in 112,717 blocks (35.2%, 405.72/Minstr), avg size 10.74 bytes
 713   Copied at {
 714     ^1: 0x4842524: memmove (vg_replace_strmem.c:1289)
 715     #2: 0x1F0A0D: copy_nonoverlapping<u8> (intrinsics.rs:1858)
 716     #3: 0x1F0A0D: copy_from_slice<u8> (mod.rs:2524)
 717     #4: 0x1F0A0D: spec_extend<u8> (vec.rs:2227)
 718     #5: 0x1F0A0D: extend_from_slice<u8> (vec.rs:1619)
 719     #6: 0x1F0A0D: push_str (string.rs:821)
 720     #7: 0x1F0A0D: write_str (string.rs:2418)
 721     #8: 0x1F0A0D: <&mut W as core::fmt::Write>::write_str (mod.rs:195)
 722   }
 723 }
 724 ]]></programlisting>
 725
 726 <para>It is very similar to the PP nodes for heap profiling, but with less
 727 information, because copy profiling doesn't involve any tracking of memory
 728 regions with lifetimes.</para>
 729
 730 </sect1>
 731
 732
 733 <sect1 id="dh-manual.ad-hoc-profiling" xreflabel="Ad hoc profiling">
 734 <title>Ad hoc profiling</title>
 735
 736 <para>If DHAT is invoked with <option>--mode=ad-hoc</option>, instead of
 737 profiling heap operations (allocations and deallocations), it profiles calls to
 738 the <computeroutput>DHAT_AD_HOC_EVENT</computeroutput> client request, which is
 739 declared in <filename>dhat/dhat.h</filename>.</para>
 740
 741 <para>Here is an example PP node from this mode:</para>
 742
 743 <programlisting><![CDATA[
 744 PP 1.1.1.1/2 {
 745   Total:     30 units (17.65%, 115.97/Minstr) in 1 events (14.29%, 3.87/Minstr), avg size 30 units
 746   Occurred at {
 747     ^1: 0x109407: g (ad-hoc.c:4)
 748     ^2: 0x109425: f (ad-hoc.c:8)
 749     #3: 0x109497: main (ad-hoc.c:14)
 750   }
 751 }
 752 ]]></programlisting>
 753
 754 <para>This kind of profiling is useful when you know a code path is hot but you
 755 want to know more about it.</para>
 756
 757 <para>For example, you might want to know which callsites of a hot function
 758 account for most of the calls. You could put a
 759 <computeroutput>DHAT_AD_HOC_EVENT(1);</computeroutput> call at the start of
 760 that function.</para>
 761
 762 <para>Alternatively, you might want to know the typical length of a vector in a
 763 hot location. You could put a
 764 <computeroutput>DHAT_AD_HOC_EVENT(len);</computeroutput> call at the
 765 appropriate location, when <computeroutput>len</computeroutput> is the length
 766 of the vector.</para>
 767
 768 </sect1>
 769
 770
 771 <sect1 id="dh-manual.options" xreflabel="DHAT Command-line Options">
 772 <title>DHAT Command-line Options</title>
 773
 774 <para>DHAT-specific command-line options are:</para>
 775
 776 <!-- start of xi:include in the manpage -->
 777 <variablelist id="dh.opts.list">
 778
 779   <varlistentry id="opt.dhat-out-file" xreflabel="--dhat-out-file">
 780     <term>
 781       <option><![CDATA[--dhat-out-file=<file> ]]></option>
 782     </term>
 783     <listitem>
 784       <para>Write the profile data to
 785             <computeroutput>file</computeroutput> rather than to the default
 786             output file,
 787             <filename>dhat.out.&lt;pid&gt;</filename>. The
 788             <option>%p</option> and <option>%q</option> format specifiers
 789             can be used to embed the process ID and/or the contents of an
 790             environment variable in the name, as is the case for the core
 791             option <option><link linkend="opt.log-file">--log-file</link></option>.
 792       </para>
 793     </listitem>
 794   </varlistentry>
 795
 796   <varlistentry id="opt.mode" xreflabel="--mode">
 797     <term>
 798       <option><![CDATA[--mode=<heap|copy|ad-hoc> [default: heap] ]]></option>
 799     </term>
 800     <listitem>
 801       <para>The profiling mode: heap profiling, copy profiling, or ad hoc
 802             profiling.
 803       </para>
 804     </listitem>
 805   </varlistentry>
 806
 807 </variablelist>
 808
 809 <para>Note that stacks by default have 12 frames. This may be more than
 810 necessary, in which case the <option>--num-callers</option> flag can be used to
 811 reduce the number, which may make DHAT run slightly faster.
 812 </para>
 813
 814 <!-- end of xi:include in the manpage -->
 815
 816 </sect1>
 817
 818 </chapter>