1 <?xml version="1.0"?> <!-- -*- sgml -*- -->
2 <!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.5//EN"
3 "http://www.oasis-open.org/docbook/xml/4.5/docbookx.dtd"
4 [ <!ENTITY % vg-entities SYSTEM "../../docs/xml/vg-entities.xml"> %vg-entities; ]>
7 <chapter id="dh-manual"
8 xreflabel="DHAT: a dynamic heap analysis tool">
9 <title>DHAT: a dynamic heap analysis tool</title>
11 <para>To use this tool, you must specify
12 <option>--tool=dhat</option> on the Valgrind command line.</para>
16 <sect1 id="dh-manual.overview" xreflabel="Overview">
17 <title>Overview</title>
19 <para>DHAT is primarily a tool for examining how programs use their heap
22 <para>It tracks the allocated blocks, and inspects every memory access
23 to find which block, if any, it is to. It presents, on a program point
24 basis, information about these blocks such as sizes, lifetimes, numbers of
25 reads and writes, and read and write patterns.</para>
27 <para>Using this information it is possible to identify program points with
28 the following characteristics:</para>
32 <listitem><para>potential process-lifetime leaks: blocks allocated
33 by the point just accumulate, and are freed only at the end of the
34 run.</para></listitem>
36 <listitem><para>excessive turnover: points which chew through a lot
37 of heap, even if it is not held onto for very long</para></listitem>
39 <listitem><para>excessively transient: points which allocate very
40 short lived blocks</para></listitem>
42 <listitem><para>useless or underused allocations: blocks which are
43 allocated but not completely filled in, or are filled in but not
44 subsequently read.</para></listitem>
46 <listitem><para>blocks with inefficient layout -- areas never
47 accessed, or with hot fields scattered throughout the
48 block.</para></listitem>
51 <para>As with the Massif heap profiler, DHAT measures program progress
52 by counting instructions, and so presents all age/time related figures
53 as instruction counts. This sounds a little odd at first, but it
54 makes runs repeatable in a way which is not possible if CPU time is
57 <para>DHAT also has support for copy profiling and ad hoc profiling. These are
58 described below.</para>
64 <sect1 id="dh-manual.profile" xreflabel="Using DHAT">
65 <title>Using DHAT</title>
67 <para>First off, as for normal Valgrind use, you probably want to compile with
68 debugging info (the <option>-g</option> option). But by contrast with normal
69 Valgrind use, you probably do want to turn optimisation on, since you should
70 profile your program as it will be normally run.</para>
72 <para>Second, you need to run your program under DHAT to gather the profiling
73 information. You might need to reduce the <option>--num-callers</option> value
74 to get reasonably-sized output files, especially if you are profiling a large
75 program; some trial and error might be needed to find a good value.</para>
77 <para>Finally, you need to use DHAT's viewer (in a web browser) to get a
78 detailed presentation of that information.</para>
81 <sect2 id="dh-manual.running-DHAT" xreflabel="Running DHAT">
82 <title>Running DHAT</title>
84 <para>To run DHAT on a program <filename>prog</filename>, run:</para>
86 valgrind --tool=dhat prog
89 <para>The program will execute (slowly). Upon completion, summary statistics
90 that look like this will be printed:</para>
92 <programlisting><![CDATA[
93 ==11514== Total: 823,849,731 bytes in 3,929,133 blocks
94 ==11514== At t-gmax: 133,485,082 bytes in 436,521 blocks
95 ==11514== At t-end: 258,002 bytes in 2,129 blocks
96 ==11514== Reads: 2,807,182,810 bytes
97 ==11514== Writes: 1,149,617,086 bytes
100 <para>The first line shows how many heap blocks and bytes were allocated over
101 the entire execution.</para>
103 <para>The second line shows how many heap blocks and bytes were alive at
104 <computeroutput>t-gmax</computeroutput>, i.e. the time when the heap size
105 reached its global maximum (as measured in bytes).</para>
107 <para>The third line shows how many heap blocks and bytes were alive at
108 <computeroutput>t-end</computeroutput>, i.e. the end of execution. In other
109 words, how many blocks and bytes were not explicitly freed. </para>
111 <para>The fourth and fifth lines show how many bytes within heap blocks were
112 read and written during the entire execution. </para>
114 <para>These lines are moderately interesting at best. More useful information
115 can be seen with DHAT's viewer.</para>
120 <sect2 id="dh-manual.outputfile" xreflabel="Output File">
121 <title>Output File</title>
123 <para>As well as printing summary information, DHAT also writes more detailed
124 profiling information to a file. By default this file is named
125 <filename>dhat.out.<pid></filename> (where
126 <filename><pid></filename> is the program's process ID), but its name can
127 be changed with the <option>--dhat-out-file</option> option. This file is JSON,
128 and intended to be viewed by DHAT's viewer, which is described in the next
131 <para>The default <computeroutput>.<pid></computeroutput> suffix on the
132 output file name serves two purposes. Firstly, it means you don't have to
133 rename old log files that you don't want to overwrite. Secondly, and more
134 importantly, it allows correct profiling with the
135 <option>--trace-children=yes</option> option of programs that spawn child
138 <para>The output file can be big, many megabytes for large applications
139 built with full debugging information.</para>
147 <sect1 id="dh-manual.viewer" xreflabel="DHAT's viewer">
148 <title>DHAT's Viewer</title>
150 <para>DHAT's viewer can be run in a web browser by loading the file
151 <computeroutput>dh_view.html</computeroutput>. Use the "Load" button to choose
152 a DHAT output file to view.</para>
154 <para>If loading takes a long time, it might be worth re-running DHAT with a
155 smaller <option>--num-callers</option> value to reduce the stack depths,
156 because this can significantly reduce the size of DHAT's output files.</para>
159 <sect2 id="dh-output-header"><title>The Output Header</title>
161 <para>The first part of the output shows the mode, program command and process
162 ID. For example:</para>
164 <programlisting><![CDATA[
167 Command: /home/njn/moz/rust0/build/x86_64-unknown-linux-gnu/stage2/bin/rustc --crate-name tuple_stress src/main.rs
172 <para>The second part of the output shows the
173 <computeroutput>t-gmax</computeroutput> and
174 <computeroutput>t-end</computeroutput> values again. For example:</para>
176 <programlisting><![CDATA[
178 t-gmax: 8,138,210,673 instrs (86.92% of program duration)
179 t-end: 9,362,544,994 instrs
186 <sect2 id="dh-ap-tree"><title>The PP Tree</title>
188 <para>The third part of the output is the largest and most interesting part,
189 showing the program point (PP) tree.</para>
192 <sect3 id="dh-structure"><title>Structure</title>
194 <para>The following image shows a screenshot of part of a PP
195 tree. The font is very small because this screenshot is intended to
196 demonstrate the high-level structure of the tree rather than the
197 details within the text. (It is also slightly out-of-date, and doesn't quite
198 match the current output produced by DHAT's viewer.)</para>
200 <graphic fileref="images/dh-tree.png" scalefit="1"/>
202 <para>Like any tree, it has a root node, leaf nodes, and non-leaf nodes. The
203 structure of the tree is shown by the lines connecting nodes. Child nodes are
204 beneath their parent and indented one level.</para>
206 <para>The sub-trees beneath a non-leaf node can be collapsed or expanded by
207 clicking on the node. It is useful to collapse sub-trees that you aren't
208 interested in.</para>
210 <para>Colours are meaningful, and are intended to ease tree navigation, but the
211 information they represent is also present within the text. (This means that
212 colour-blind users are not denied any information.)</para>
214 <para>Each leaf node is coloured green. Each non-leaf node is coloured blue
215 and has a down arrow (<computeroutput>▼</computeroutput>) next to it when
216 its sub-tree is expanded. Each non-leaf node is coloured yellow and has a
217 left arrow (<computeroutput>▶</computeroutput>) next to it when its sub-tree
220 <para>The shade of green, blue or yellow used for a node indicate its
221 significance. Darker shades represent greater significance (in terms of bytes
224 <para>Note that the entire output is text, even the arrows and lines connecting
225 nodes. This means you can copy and paste any part of the output easily into an
226 email, bug report, etc.</para>
231 <sect3 id="dh-root-node"><title>The Root Node</title>
233 <para>The root node looks like this:</para>
235 <programlisting><![CDATA[
236 PP 1/1 (25 children) {
237 Total: 1,355,253,987 bytes (100%, 67,454.81/Minstr) in 5,943,417 blocks (100%, 295.82/Minstr), avg size 228.03 bytes, avg lifetime 3,134,692,250.67 instrs (15.6% of program duration)
238 At t-gmax: 423,930,307 bytes (100%) in 1,575,682 blocks (100%), avg size 269.05 bytes
239 At t-end: 258,002 bytes (100%) in 2,129 blocks (100%), avg size 121.18 bytes
240 Reads: 5,478,606,988 bytes (100%, 272,685.7/Minstr), 4.04/byte
241 Writes: 2,040,294,800 bytes (100%, 101,551.22/Minstr), 1.51/byte
248 <para>The root node covers the entire execution. The information is a superset
249 of the information shown when DHAT ran, adding details such as allocation
250 rates, average block sizes, block lifetimes, and read and write ratios. The
251 next example will explain these in more detail.</para>
256 <sect3 id="dh-interior-nodes"><title>Interior Nodes</title>
258 <para>PP nodes further down the tree show information about a subset of
259 allocations. For example:</para>
261 <programlisting><![CDATA[
262 PP 1.1/25 (2 children) {
263 Total: 54,533,440 bytes (4.02%, 2,714.28/Minstr) in 458,839 blocks (7.72%, 22.84/Minstr), avg size 118.85 bytes, avg lifetime 1,127,259,403.64 instrs (5.61% of program duration)
264 At t-gmax: 0 bytes (0%) in 0 blocks (0%), avg size 0 bytes
265 At t-end: 0 bytes (0%) in 0 blocks (0%), avg size 0 bytes
266 Reads: 15,993,012 bytes (0.29%, 796.02/Minstr), 0.29/byte
267 Writes: 20,974,752 bytes (1.03%, 1,043.97/Minstr), 0.38/byte
269 #1: 0x95CACC9: alloc (alloc.rs:72)
270 #2: 0x95CACC9: alloc (alloc.rs:148)
271 #3: 0x95CACC9: reserve_internal<syntax::tokenstream::TokenStream,alloc::alloc::Global> (raw_vec.rs:669)
272 #4: 0x95CACC9: reserve<syntax::tokenstream::TokenStream,alloc::alloc::Global> (raw_vec.rs:492)
273 #5: 0x95CACC9: reserve<syntax::tokenstream::TokenStream> (vec.rs:460)
274 #6: 0x95CACC9: push<syntax::tokenstream::TokenStream> (vec.rs:989)
275 #7: 0x95CACC9: parse_token_trees_until_close_delim (tokentrees.rs:27)
276 #8: 0x95CACC9: syntax::parse::lexer::tokentrees::<impl syntax::parse::lexer::StringReader<'a>>::parse_token_tree (tokentrees.rs:81)
281 <para>The first line indicates the node's position in the tree. The
282 <computeroutput>1.1</computeroutput> is a unique identifier for the node and
283 also says that it is the first child node <computeroutput>1</computeroutput>
284 (which is the root). The <computeroutput>/25</computeroutput> says that it is
285 one of 25 children, i.e. it has 24 siblings. The <computeroutput>(2
286 children)</computeroutput> says that this node node has two children of its
289 <para>Allocations are aggregated by their allocation stack trace. The
290 <computeroutput>Allocated at</computeroutput> section shows the allocation
291 stack trace that is shared by all the blocks covered by this node.</para>
293 <para>The <computeroutput>Total</computeroutput> line shows that this node
294 accounts for 4.02% of all bytes allocated during execution, and 7.72% of all
295 blocks. These percentages are useful for comparing the significance of
296 different nodes within a single profile; a PP that accounts for 10% of bytes
297 allocated is likely to be more interesting than one that accounts for
300 <para>The <computeroutput>Total</computeroutput> line also shows allocation
301 rates, measured in bytes and blocks per million instructions. These rates are
302 useful for comparing the significance of nodes across profiles made with
303 different workloads.</para>
305 <para>Finally, the <computeroutput>Total</computeroutput> line shows the
306 average size and lifetimes of these blocks.</para>
308 <para>The <computeroutput>At t-gmax</computeroutput> line says shows that no
309 blocks from this PP were alive when the global heap peak occurred. In other
310 words, these blocks do not contribute at all to the global heap peak.</para>
312 <para>The <computeroutput>At t-end</computeroutput> line shows that no blocks
313 were from this PP were alive at shutdown. In other words, all those blocks were
314 explicitly freed before termination.</para>
316 <para>The <computeroutput>Reads</computeroutput> and
317 <computeroutput>Writes</computeroutput> lines show how many bytes were read
318 within this PP's blocks, the fraction this represents of all heap reads, and
319 the read rate. Finally, it shows the read ratio, which is the number of reads
320 per byte. In this case the number is 0.29, which is quite low -- if no byte was
321 read twice, then only 29% of the allocated bytes, which means that at least 71%
322 of the bytes were never read! This suggests that the blocks are being
323 underutilized and might be worth optimizing.</para>
325 <para>The <computeroutput>Writes</computeroutput> lines is similar to the
326 <computeroutput>Reads</computeroutput> line. In this case, at most 38% of the
327 bytes are ever written, and at least 62% of the bytes were never written.
330 <para>The <computeroutput>Reads</computeroutput> and
331 <computeroutput>Writes</computeroutput> measurements suggest that the blocks
332 are being under-utilised and might be worth optimizing. Having said that, this
333 kind of under-utilisation is common in data structures that grow, such as
334 vectors and hash tables, and isn't always fixable. </para>
339 <sect3 id="dh-leaf-nodes"><title>Leaf Nodes</title>
341 <para>This is a leaf node:</para>
343 <programlisting><![CDATA[
345 Total: 31,460,928 bytes (2.32%, 1,565.9/Minstr) in 262,171 blocks (4.41%, 13.05/Minstr), avg size 120 bytes, avg lifetime 986,406,885.05 instrs (4.91% of program duration)
346 Max: 16,779,136 bytes in 65,543 blocks, avg size 256 bytes
347 At t-gmax: 0 bytes (0%) in 0 blocks (0%), avg size 0 bytes
348 At t-end: 0 bytes (0%) in 0 blocks (0%), avg size 0 bytes
349 Reads: 5,964,704 bytes (0.11%, 296.88/Minstr), 0.19/byte
350 Writes: 10,487,200 bytes (0.51%, 521.98/Minstr), 0.33/byte
352 ^1: 0x95CACC9: alloc (alloc.rs:72)
353 ^2: 0x95CACC9: alloc (alloc.rs:148)
354 ^3: 0x95CACC9: reserve_internal<syntax::tokenstream::TokenStream,alloc::alloc::Global> (raw_vec.rs:669)
355 ^4: 0x95CACC9: reserve<syntax::tokenstream::TokenStream,alloc::alloc::Global> (raw_vec.rs:492)
356 ^5: 0x95CACC9: reserve<syntax::tokenstream::TokenStream> (vec.rs:460)
357 ^6: 0x95CACC9: push<syntax::tokenstream::TokenStream> (vec.rs:989)
358 ^7: 0x95CACC9: parse_token_trees_until_close_delim (tokentrees.rs:27)
359 ^8: 0x95CACC9: syntax::parse::lexer::tokentrees::<impl syntax::parse::lexer::StringReader<'a>>::parse_token_tree (tokentrees.rs:81)
360 ^9: 0x95CAC39: parse_token_trees_until_close_delim (tokentrees.rs:26)
361 ^10: 0x95CAC39: syntax::parse::lexer::tokentrees::<impl syntax::parse::lexer::StringReader<'a>>::parse_token_tree (tokentrees.rs:81)
362 #11: 0x95CAC39: parse_token_trees_until_close_delim (tokentrees.rs:26)
363 #12: 0x95CAC39: syntax::parse::lexer::tokentrees::<impl syntax::parse::lexer::StringReader<'a>>::parse_token_tree (tokentrees.rs:81)
368 <para>The <computeroutput>1.1.1.1/2</computeroutput> indicates that this node
369 is a great-grandchild of the root; is the first grandchild of the node in the
370 previous example; and has no children.</para>
372 <para>Leaf nodes contain an additional <computeroutput>Max</computeroutput>
373 line, indicating the peak memory use for the blocks covered by this PP. (This
374 peak may have occurred at a time other than
375 <computeroutput>t-gmax</computeroutput>.) In this case, 31,460,298 bytes were
376 allocated from this PP, but the maximum size alive at once was 16,779,136
379 <para>Stack frames that begin with a <computeroutput>^</computeroutput> rather
380 than a <computeroutput>#</computeroutput> are copied from ancestor nodes.
381 (In this example, the first 8 frames are identical to those from the node in
382 the previous example.) These frames could be found by tracing back through
383 ancestor nodes, but that can be annoying, which is why they are duplicated.
384 This also means that each node makes complete sense on its own.</para>
389 <sect3 id="dh-access-counts"><title>Access Counts</title>
391 <para>If all blocks covered by a PP node have the same size, an additional
392 <computeroutput>Accesses</computeroutput> field will be present. It indicates
393 how the reads and writes within these blocks were distributed. For
396 <programlisting><![CDATA[
397 Total: 8,388,672 bytes (0.62%, 417.53/Minstr) in 262,146 blocks (4.41%, 13.05/Minstr), avg size 32 bytes, avg lifetime 16,726,078,401.51 instrs (83.25% of program duration)
398 At t-gmax: 8,388,672 bytes (1.98%) in 262,146 blocks (16.64%), avg size 32 bytes
399 At t-end: 0 bytes (0%) in 0 blocks (0%), avg size 0 bytes
400 Reads: 9,109,682 bytes (0.17%, 453.41/Minstr), 1.09/byte
401 Writes: 7,340,088 bytes (0.36%, 365.34/Minstr), 0.88/byte
403 [ 0] 65547 7 8 4 65529 〃 〃 〃 16 〃 〃 〃 12 〃 〃 〃 〃 〃 〃 〃 〃 〃 〃 〃 65542 〃 〃 〃 - - - -
407 <para>Every block covered by this PP was 32 bytes. Within all of those blocks,
408 byte 0 was accessed (read or written) 65,547 times, byte 1 was accessed 7
409 times, byte 2 was accessed 8 times, and so on.</para>
411 <para>The ditto symbol (<computeroutput>〃</computeroutput>) means "same access
412 count as the previous byte".</para>
414 <para>A dash (<computeroutput>-</computeroutput>) means "zero". (It is used
415 instead of <computeroutput>0</computeroutput> because it makes unaccessed
416 regions more easily identifiable.)</para>
418 <para>The infinity symbol (<computeroutput>∞</computeroutput>, not present in
419 this example) means "exceeded the maximum tracked count".</para>
421 <para>Block layout can often be inferred from counts. For example, these blocks
422 probably have four separate byte-sized fields, followed by a four-byte field,
425 <para>The size of the blocks that measure and display access counts is limited
426 to 1024 bytes. This is done to limit the performance overhead and also to keep
427 the size of the generated output reasonable. However, it is possible to override
428 this limit using client requests. The use-case for this is to first run DHAT
429 normally, and then identify any large blocks that you would like to further
430 investigate with access count histograms. The client request is declared in
431 <filename>dhat/dhat.h</filename> and is called <computeroutput>DHAT_HISTOGRAM_MEMORY</computeroutput>.
432 The macro should be placed immediately after the call to the allocator,
433 and use the pointer returned by the allocator.</para>
435 <programlisting><![CDATA[
436 // LargeStruct bigger than 1024 bytes
437 struct LargeStruct* ls = malloc(sizeof(struct LargeStruct));
438 DHAT_HISTOGRAM_MEMORY(ls);
441 <para>The memory that can be profiled in this way with user requests
442 has a further upper limit of 25kbytes. Be aware that the access counts
443 will all be set to zero. This means that the access counts will not
444 include any reads or writes performed during initialisation. An example where this
445 will happen are uses of C++ <computeroutput>new</computeroutput> with user-defined constructors.</para>
447 <para>Access counts can be useful for identifying data alignment holes or other
448 layout inefficiencies.</para>
453 <sect3 id="aggregate-nodes"><title>Aggregate Nodes</title>
455 <para>The PP tree is very large and many nodes represent tiny numbers of blocks
456 and bytes. Therefore, DHAT's viewer aggregates insignificant nodes like
459 <programlisting><![CDATA[
461 Total: 5,175 blocks (0.09%, 0.26/Minstr)
468 <para>Much of the detail is stripped away, leaving only basic measurements,
469 along with an indication of how many nodes were aggregated together (5 in this
477 <sect2 id="dh-output-footer"><title>The Output Footer</title>
479 <para>Below the PP tree is a line like this:</para>
481 <programlisting><![CDATA[
482 PP significance threshold: total >= 59,434.17 blocks (1%)
485 <para>It shows the function used to determine if a PP node is significant. All
486 nodes that don't satisfy this function are aggregated. It is occasionally
487 useful if you don't understand why a PP node has been aggregated. The exact
488 threshold depends on the sort metric (see below).</para>
490 <para>Finally, the bottom of the page shows a legend that explains some of the
491 terms, abbreviations and symbols used in the output.</para>
496 <sect2 id="dh-sort-metrics"><title>Sort Metrics</title>
498 <para>The order in which sub-trees are sorted can be changed via the "Sort
499 metric" drop-down menu at the top of DHAT's viewer. Different sort metrics can
500 be useful for finding different things. Some sort metrics also incorporate some
501 filtering, so that only nodes meeting a particular criteria are shown.</para>
503 <!-- start of xi:include in the manpage -->
507 <term>Total (bytes)</term>
508 <listitem><para>The total number of bytes allocated during the execution.
509 Highly useful for evaluating heap churn, though not quite as useful as
515 <term>Total (blocks)</term>
516 <listitem><para>The total number of blocks allocated during the execution.
517 Highly useful for evaluating heap churn; reducing the number of calls to
518 the allocator can significantly speed up a program. This is the default
524 <term>Total (blocks), tiny</term>
525 <listitem><para>Like "Total (blocks)", but shows only very small blocks.
526 Moderately useful, because such blocks are often easy to avoid allocating.
531 <term>Total (blocks), short-lived</term>
532 <listitem><para>Like "Total (blocks)", but shows only very short-lived
533 blocks. Moderately useful, because such blocks are often easy to avoid
539 <term>Total (bytes), zero reads or zero writes</term>
540 <listitem><para>Like "Total (bytes)", but shows only blocks that are
541 never read or never written to (or both). Highly useful, because such
542 blocks indicate poor use of memory and are often easy to avoid allocating.
543 For example, sometimes a block is allocated and written to but then only
544 read if a condition C is true; in that case, it may be possible to delay
545 creating the block until condition C is true. Alternatively, sometimes
546 blocks are created and never used; such blocks are trivial to remove.
551 <term>Total (blocks), zero reads or zero writes</term>
552 <listitem><para>Like "Total (bytes), zero reads or zero writes" but for
553 blocks. Highly useful.
558 <term>Total (bytes), low-access</term>
559 <listitem><para>Like "Total (bytes)", but shows only blocks that have low
560 numbers of reads or low numbers of writes (or both). Moderately useful,
561 because such blocks indicate poor use of memory.
566 <term>Total (blocks), low-access</term>
567 <listitem><para>Like "Total (bytes), low-access", but for blocks.
572 <term>At t-gmax (bytes)</term>
573 <listitem><para>This shows the breakdown of memory at the point of peak
574 heap memory usage. Highly useful for reducing peak memory usage.
579 <term>At t-end (bytes)</term>
580 <listitem><para>This shows the breakdown of memory at program termination.
581 Highly useful for identifying process-lifetime leaks.
586 <term>Reads (bytes)</term>
587 <listitem><para>The number of bytes read within heap blocks. Occasionally
593 <term>Reads (bytes), high-access</term>
594 <listitem><para>Like "Reads (bytes)", but only shows blocks with high read
595 ratios. Occasionally useful for identifying hot areas of memory.
600 <term>Writes (bytes)</term>
601 <listitem><para>Like "Reads (bytes)", but for writes. Occasionally useful.
606 <term>Writes (bytes), high-access</term>
607 <listitem><para>Like "Reads (bytes), high-access", but for writes.
614 <para>The values within a node that represent the chosen sort metric are shown
615 in bold, so they stand out.</para>
617 <para>Here is part of a PP node found with "Total (blocks), tiny", showing
618 blocks with an average size of only 8.67 bytes:</para>
620 <programlisting><![CDATA[
621 Total: 3,407,848 bytes (0.25%, 169.62/Minstr) in 393,214 blocks (6.62%, 19.57/Minstr), avg size 8.67 bytes, avg lifetime 1,167,795,629.1 instrs (5.81% of program duration)
624 <para>Here is part of a PP node found with "Total (blocks), short-lived",
625 showing blocks with an average lifetime of only 181.75 instructions:</para>
627 <programlisting><![CDATA[
628 Total: 23,068,584 bytes (1.7%, 1,148.19/Minstr) in 262,143 blocks (4.41%, 13.05/Minstr), avg size 88 bytes, avg lifetime 181.75 instrs (0% of program duration)
631 <para>Here is an example of a PP identified with "Total (blocks), zero reads
632 or zero writes", showing blocks that are allocated but never touched:</para>
634 <programlisting><![CDATA[
635 Total: 7,339,920 bytes (0.54%, 365.33/Minstr) in 262,140 blocks (4.41%, 13.05/Minstr), avg size 28 bytes, avg lifetime 1,141,103,997.69 instrs (5.68% of program duration)
636 Max: 3,669,960 bytes in 131,070 blocks, avg size 28 bytes
637 At t-gmax: 3,336,400 bytes (0.79%) in 119,157 blocks (7.56%), avg size 28 bytes
638 At t-end: 0 bytes (0%) in 0 blocks (0%), avg size 0 bytes
639 Reads: 0 bytes (0%, 0/Minstr), 0/byte
640 Writes: 0 bytes (0%, 0/Minstr), 0/byte
643 <para>All the blocks identified by these PPs are good candidates for
651 <sect1 id="dh-manual.realloc" xreflabel="Treatment of realloc">
652 <title>Treatment of realloc</title>
654 <para><computeroutput>realloc</computeroutput> is a tricky function and there
655 are several different ways that DHAT could handle it.</para>
657 <para>Imagine a <computeroutput>malloc(100)</computeroutput> call followed by
658 a <computeroutput>realloc(200)</computeroutput> call. This combination is
659 considered to add two to the total block count, and 300 bytes to the total
660 bytes count. (An alternative would be to only add one to the total block
661 count, and 200 bytes to the total bytes count, as if a single
662 <computeroutput>malloc(200)</computeroutput> call had occurred. While this
663 would be defensible from a semantic point of view, it is silly from an
664 operational point of view, because making two calls to allocator functions is
665 more expensive than one call, and DHAT is a profiler that aims to help with
666 runtime costs.)</para>
668 <para>Furthermore, the implicit copying of the 100 bytes is added to the reads
669 and writes counts. Without this, the read and write counts would be
670 under-measured and misleading.</para>
672 <para>However, DHAT only increases the current heap size by 100 bytes for this
673 combination, and does not change the current block count. (As opposed to
674 increasing the current heap size by 200 bytes and then decreasing it by 100
675 bytes.) As a result, it can only increase the global heap peak (if indeed,
676 this results in a new peak) by 100 bytes.</para>
678 <para>Finally, the program point assigned to the block allocated by the
679 <computeroutput>malloc(100)</computeroutput> call is retained once the block
680 is reallocated. Which means that all 300 bytes are attributed to that
681 program point, and no separate program point is created for the
682 <computeroutput>realloc(200)</computeroutput> call. This may be surprising,
683 but it has one large benefit.</para>
685 <para>Imagine some code that starts with an empty buffer, and then gradually
686 adds data to that buffer from numerous different points in the code,
687 reallocating the buffer each time it gets full. (E.g. code generation in a
688 compiler might work this way.) With the described approach, the first heap
689 block and all subsequent heap blocks are attributed to the same program point.
690 While this is something of a lie -- the first program point isn't actually
691 responsible for the other allocations -- it is arguably better than having the
692 program points spread around in a distribution that unpredictably depends on
693 whenever the reallocations were triggered.</para>
698 <sect1 id="dh-manual.copy-profiling" xreflabel="Copy profiling">
699 <title>Copy profiling</title>
701 <para>If DHAT is invoked with <option>--mode=copy</option>, instead of
702 profiling heap operations (allocations and deallocations), it profiles copy
703 operations, such as <computeroutput>memcpy</computeroutput>,
704 <computeroutput>memmove</computeroutput>,
705 <computeroutput>strcpy</computeroutput>, and
706 <computeroutput>bcopy</computeroutput>. This is sometimes useful.</para>
708 <para>Here is an example PP node from this mode:</para>
710 <programlisting><![CDATA[
711 PP 1.1.2/5 (4 children) {
712 Total: 1,210,925 bytes (10.03%, 4,358.66/Minstr) in 112,717 blocks (35.2%, 405.72/Minstr), avg size 10.74 bytes
714 ^1: 0x4842524: memmove (vg_replace_strmem.c:1289)
715 #2: 0x1F0A0D: copy_nonoverlapping<u8> (intrinsics.rs:1858)
716 #3: 0x1F0A0D: copy_from_slice<u8> (mod.rs:2524)
717 #4: 0x1F0A0D: spec_extend<u8> (vec.rs:2227)
718 #5: 0x1F0A0D: extend_from_slice<u8> (vec.rs:1619)
719 #6: 0x1F0A0D: push_str (string.rs:821)
720 #7: 0x1F0A0D: write_str (string.rs:2418)
721 #8: 0x1F0A0D: <&mut W as core::fmt::Write>::write_str (mod.rs:195)
726 <para>It is very similar to the PP nodes for heap profiling, but with less
727 information, because copy profiling doesn't involve any tracking of memory
728 regions with lifetimes.</para>
733 <sect1 id="dh-manual.ad-hoc-profiling" xreflabel="Ad hoc profiling">
734 <title>Ad hoc profiling</title>
736 <para>If DHAT is invoked with <option>--mode=ad-hoc</option>, instead of
737 profiling heap operations (allocations and deallocations), it profiles calls to
738 the <computeroutput>DHAT_AD_HOC_EVENT</computeroutput> client request, which is
739 declared in <filename>dhat/dhat.h</filename>.</para>
741 <para>Here is an example PP node from this mode:</para>
743 <programlisting><![CDATA[
745 Total: 30 units (17.65%, 115.97/Minstr) in 1 events (14.29%, 3.87/Minstr), avg size 30 units
747 ^1: 0x109407: g (ad-hoc.c:4)
748 ^2: 0x109425: f (ad-hoc.c:8)
749 #3: 0x109497: main (ad-hoc.c:14)
754 <para>This kind of profiling is useful when you know a code path is hot but you
755 want to know more about it.</para>
757 <para>For example, you might want to know which callsites of a hot function
758 account for most of the calls. You could put a
759 <computeroutput>DHAT_AD_HOC_EVENT(1);</computeroutput> call at the start of
760 that function.</para>
762 <para>Alternatively, you might want to know the typical length of a vector in a
763 hot location. You could put a
764 <computeroutput>DHAT_AD_HOC_EVENT(len);</computeroutput> call at the
765 appropriate location, when <computeroutput>len</computeroutput> is the length
766 of the vector.</para>
771 <sect1 id="dh-manual.options" xreflabel="DHAT Command-line Options">
772 <title>DHAT Command-line Options</title>
774 <para>DHAT-specific command-line options are:</para>
776 <!-- start of xi:include in the manpage -->
777 <variablelist id="dh.opts.list">
779 <varlistentry id="opt.dhat-out-file" xreflabel="--dhat-out-file">
781 <option><![CDATA[--dhat-out-file=<file> ]]></option>
784 <para>Write the profile data to
785 <computeroutput>file</computeroutput> rather than to the default
787 <filename>dhat.out.<pid></filename>. The
788 <option>%p</option> and <option>%q</option> format specifiers
789 can be used to embed the process ID and/or the contents of an
790 environment variable in the name, as is the case for the core
791 option <option><link linkend="opt.log-file">--log-file</link></option>.
796 <varlistentry id="opt.mode" xreflabel="--mode">
798 <option><![CDATA[--mode=<heap|copy|ad-hoc> [default: heap] ]]></option>
801 <para>The profiling mode: heap profiling, copy profiling, or ad hoc
809 <para>Note that stacks by default have 12 frames. This may be more than
810 necessary, in which case the <option>--num-callers</option> flag can be used to
811 reduce the number, which may make DHAT run slightly faster.
814 <!-- end of xi:include in the manpage -->