cachegrind/docs/cg-manual.xml

   1 <?xml version="1.0"?> <!-- -*- sgml -*- -->
   2 <!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.5//EN"
   3   "http://www.oasis-open.org/docbook/xml/4.5/docbookx.dtd"
   4 [ <!ENTITY % vg-entities SYSTEM "../../docs/xml/vg-entities.xml"> %vg-entities; ]>
   5
   6 <!-- Referenced from both the manual and manpage -->
   7 <chapter id="&vg-cg-manual-id;" xreflabel="&vg-cg-manual-label;">
   8 <title>Cachegrind: a high-precision tracing profiler</title>
   9
  10 <para>
  11 To use this tool, specify <option>--tool=cachegrind</option> on the Valgrind
  12 command line.
  13 </para>
  14
  15 <sect1 id="cg-manual.overview" xreflabel="Overview">
  16 <title>Overview</title>
  17
  18 <para>
  19 Cachegrind is a high-precision tracing profiler. It runs slowly, but collects
  20 precise and reproducible profiling data. It can merge and diff data from
  21 different runs. To expand on these characteristics:
  22 </para>
  23
  24 <itemizedlist>
  25   <listitem>
  26     <para>
  27     <emphasis>Precise.</emphasis> Cachegrind measures the exact number of
  28     instructions executed by your program, not an approximation. Furthermore,
  29     it presents the gathered data at the file, function, and line level. This
  30     is different to many other profilers that measure approximate execution
  31     time, using sampling, and only at the function level.
  32     </para>
  33   </listitem>
  34
  35   <listitem>
  36     <para>
  37     <emphasis>Reproducible.</emphasis> In general, execution time is a better
  38     metric than instruction counts because it's what users perceive. However,
  39     execution time often has high variability. When running the exact same
  40     program on the exact same input multiple times, execution time might vary
  41     by several percent. Furthermore, small changes in a program can change its
  42     memory layout and have even larger effects on runtime. In contrast,
  43     instruction counts are highly reproducible; for some programs they are
  44     perfectly reproducible. This means the effects of small changes in a
  45     program can be measured with high precision.
  46     </para>
  47   </listitem>
  48 </itemizedlist>
  49
  50 <para>
  51 For these reasons, Cachegrind is an excellent complement to time-based profilers.
  52 </para>
  53
  54 <para>
  55 Cachegrind can annotate programs written in any language, so long as debug info
  56 is present to map machine code back to the original source code. Cachegrind has
  57 been used successfully on programs written in C, C++, Rust, and assembly.
  58 </para>
  59
  60 <para>
  61 Cachegrind can also simulate how your program interacts with a machine's cache
  62 hierarchy and branch predictor. This simulation was the original motivation for
  63 the tool, hence its name. However, the simulations are basic and unlikely to
  64 reflect the behaviour of a modern machine. For this reason they are off by
  65 default. If you really want cache and branch information, a profiler like
  66 <computeroutput>perf</computeroutput> that accesses hardware counters is a
  67 better choice.
  68 </para>
  69
  70 </sect1>
  71
  72
  73 <sect1 id="cg-manual.profile"
  74        xreflabel="Using Cachegrind and cg_annotate">
  75 <title>Using Cachegrind and cg_annotate</title>
  76
  77 <para>
  78 First, as for normal Valgrind use, you should compile with debugging info (the
  79 <option>-g</option> option in most compilers). But by contrast with normal
  80 Valgrind use, you probably do want to turn optimisation on, since you should
  81 profile your program as it will be normally run.
  82 </para>
  83
  84 <para>
  85 Second, run Cachegrind itself to gather the profiling data.
  86 </para>
  87
  88 <para>
  89 Third, run cg_annotate to get a detailed presentation of that data. cg_annotate
  90 can combine the results of multiple Cachegrind output files. It can also
  91 perform a diff between two Cachegrind output files.
  92 </para>
  93
  94
  95 <sect2 id="cg-manual.running-cachegrind" xreflabel="Running Cachegrind">
  96 <title>Running Cachegrind</title>
  97
  98 <para>
  99 To run Cachegrind on a program <filename>prog</filename>, run:
 100 <screen><![CDATA[
 101 valgrind --tool=cachegrind prog
 102 ]]></screen>
 103 </para>
 104
 105 <para>
 106 The program will execute (slowly). Upon completion, summary statistics that
 107 look like this will be printed:
 108 </para>
 109
 110 <programlisting><![CDATA[
 111 ==17942== I refs:          8,195,070
 112 ]]></programlisting>
 113
 114 <para>
 115 The <computeroutput>I refs</computeroutput> number is short for "Instruction
 116 cache references", which is equivalent to "instructions executed". If you
 117 enable the cache and/or branch simulation, additional counts will be shown.
 118 </para>
 119
 120 </sect2>
 121
 122
 123 <sect2 id="cg-manual.outputfile" xreflabel="Output File">
 124 <title>Output File</title>
 125
 126 <para>
 127 Cachegrind also writes more detailed profiling data to a file. By default this
 128 Cachegrind output file is named <filename>cachegrind.out.&lt;pid&gt;</filename>
 129 (where <filename>&lt;pid&gt;</filename> is the program's process ID), but its
 130 name can be changed with the <option>--cachegrind-out-file</option> option.
 131 This file is human-readable, but is intended to be interpreted by the
 132 accompanying program cg_annotate, described in the next section.
 133 </para>
 134
 135 <para>
 136 The default <computeroutput>.&lt;pid&gt;</computeroutput> suffix on the output
 137 file name serves two purposes. First, it means existing Cachegrind output files
 138 aren't immediately overwritten. Second, and more importantly, it allows correct
 139 profiling with the <option>--trace-children=yes</option> option of programs
 140 that spawn child processes.
 141 </para>
 142
 143 </sect2>
 144
 145
 146 <sect2 id="cg-manual.running-cg_annotate" xreflabel="Running cg_annotate">
 147 <title>Running cg_annotate</title>
 148
 149 <para>
 150 Before using cg_annotate, it is worth widening your window to be at least 120
 151 characters wide if possible, because the output lines can be quite long.
 152 </para>
 153
 154 <para>
 155 Then run:
 156 <screen>cg_annotate &lt;filename&gt;</screen>
 157 on a Cachegrind output file.
 158 </para>
 159
 160 </sect2>
 161
 162 <!--
 163 To produce the sample date, I did the following. Note that the single hypens in
 164 the valgrind command should be double hyphens, but XML doesn't allow double
 165 hyphens in comments.
 166
 167   gcc -g -O concord.c -o concord
 168   valgrind -tool=cachegrind -cachegrind-out-file=concord.cgout ./concord ../cg_main.c
 169   (to exit, type `q` and hit enter)
 170   python ../cg_annotate concord.cgout > concord.cgann
 171
 172 concord.c is a small C program I wrote at university. It's a good size for an example.
 173 -->
 174
 175 <sect2 id="cg-manual.the-metadata" xreflabel="The Metadata Section">
 176 <title>The Metadata Section</title>
 177
 178 <para>
 179 The first part of the output looks like this:
 180 </para>
 181
 182 <programlisting><![CDATA[
 183 --------------------------------------------------------------------------------
 184 -- Metadata
 185 --------------------------------------------------------------------------------
 186 Invocation:       ../cg_annotate concord.cgout
 187 Command:          ./concord ../cg_main.c
 188 Events recorded:  Ir
 189 Events shown:     Ir
 190 Event sort order: Ir
 191 Threshold:        0.1%
 192 Annotation:       on
 193 ]]></programlisting>
 194
 195 <para>
 196 It summarizes how Cachegrind and the profiled program were run.
 197 </para>
 198
 199 <itemizedlist>
 200   <listitem>
 201     <para>
 202     Invocation: the command line used to produce this output.
 203     </para>
 204   </listitem>
 205
 206   <listitem>
 207     <para>
 208     Command: the command line used to run the profiled program.
 209     </para>
 210   </listitem>
 211
 212   <listitem>
 213     <para>
 214     Events recorded: which events were recorded. By default, this is
 215     <computeroutput>Ir</computeroutput>. More events will be recorded if cache
 216     and/or branch simulation is enabled.
 217     </para>
 218   </listitem>
 219
 220   <listitem>
 221     <para>
 222     Events shown: the events shown, which is a subset of the events gathered.
 223     This can be adjusted with the <option>--show</option> option.
 224     </para>
 225   </listitem>
 226
 227   <listitem>
 228     <para>
 229     Event sort order: the sort order used for the subsequent sections. For
 230     example, in this case those sections are sorted from highest
 231     <computeroutput>Ir</computeroutput> counts to lowest. If there are multiple
 232     events, one will be the primary sort event, and then there can be a
 233     secondary sort event, tertiary sort event, etc., though more than one is
 234     rarely needed. This order can be adjusted with the <option>--sort</option>
 235     option. Note that this does <emphasis>not</emphasis> specify the order in
 236     which the columns appear. That is specified by the "events shown" line (and
 237     can be changed with the <option>--show</option> option).
 238     </para>
 239   </listitem>
 240
 241   <listitem>
 242     <para>
 243     Threshold: cg_annotate by default omits files and functions with very low
 244     counts to keep the output size reasonable. By default cg_annotate only
 245     shows files and functions that account for at least 0.1% of the primary
 246     sort event. The threshold can be adjusted with the
 247     <option>--threshold</option> option.
 248     </para>
 249   </listitem>
 250
 251   <listitem>
 252     <para>
 253     Annotation: whether source file annotation is enabled. Controlled with the
 254     <option>--annotate</option> option.
 255     </para>
 256   </listitem>
 257
 258 </itemizedlist>
 259
 260 <para>
 261 If cache simulation is enabled, details of the cache parameters will be shown
 262 above the "Invocation" line.
 263 </para>
 264
 265 </sect2>
 266
 267
 268 <sect2 id="cg-manual.the-global"
 269        xreflabel="Global, File, and Function-level Counts">
 270 <title>Global, File, and Function-level Counts</title>
 271
 272 <para>
 273 Next comes the summary for the whole program:
 274 </para>
 275
 276 <programlisting><![CDATA[
 277 --------------------------------------------------------------------------------
 278 -- Summary
 279 --------------------------------------------------------------------------------
 280 Ir________________
 281
 282 8,195,070 (100.0%)  PROGRAM TOTALS
 283 ]]></programlisting>
 284
 285 <para>
 286 The <computeroutput>Ir</computeroutput> column label is suffixed with
 287 underscores to show the bounds of the columns underneath.
 288 </para>
 289
 290 <para>
 291 Then comes file:function counts. Here is the first part of that section:
 292 </para>
 293
 294 <programlisting><![CDATA[
 295 --------------------------------------------------------------------------------
 296 -- File:function summary
 297 --------------------------------------------------------------------------------
 298   Ir______________________  file:function
 299
 300 < 3,078,746 (37.6%, 37.6%)  /home/njn/grind/ws1/cachegrind/concord.c:
 301   1,630,232 (19.9%)           get_word
 302     630,918  (7.7%)           hash
 303     461,095  (5.6%)           insert
 304     130,560  (1.6%)           add_existing
 305      91,014  (1.1%)           init_hash_table
 306      88,056  (1.1%)           create
 307      46,676  (0.6%)           new_word_node
 308
 309 < 1,746,038 (21.3%, 58.9%)  ./malloc/./malloc/malloc.c:
 310   1,285,938 (15.7%)           _int_malloc
 311     458,225  (5.6%)           malloc
 312
 313 < 1,107,550 (13.5%, 72.4%)  ./libio/./libio/getc.c:getc
 314
 315 <   551,071  (6.7%, 79.1%)  ./string/../sysdeps/x86_64/multiarch/strcmp-avx2.S:__strcmp_avx2
 316
 317 <   521,228  (6.4%, 85.5%)  ./ctype/../include/ctype.h:
 318     260,616  (3.2%)           __ctype_tolower_loc
 319     260,612  (3.2%)           __ctype_b_loc
 320
 321 <   468,163  (5.7%, 91.2%)  ???:
 322     468,151  (5.7%)           ???
 323
 324 <   456,071  (5.6%, 96.8%)  /usr/include/ctype.h:get_word
 325
 326 ]]></programlisting>
 327
 328 <para>
 329 Each entry covers one file, and one or more functions within that file. If
 330 there is only one significant function within a file, as in the first entry,
 331 the file and function are shown on the same line separate by a colon. If there
 332 are multiple significant functions within a file, as in the third entry, each
 333 function gets its own line.
 334 </para>
 335
 336 <para>
 337 This example involves a small C program, and shows a combination of code from
 338 the program itself (including functions like <function>get_word</function> and
 339 <function>hash</function> in the file <filename>concord.c</filename>) as well
 340 as code from system libraries, such as functions like
 341 <function>malloc</function> and <function>getc</function>.
 342 </para>
 343
 344 <para>
 345 Each entry is preceded with a <computeroutput>&lt;</computeroutput>, which can
 346 be useful when navigating through the output in an editor, or grepping through
 347 results.
 348 </para>
 349
 350 <para>
 351 The first percentage in each column indicates the proportion of the total event
 352 count is covered by this line. The second percentage, which only shows on the
 353 first line of each entry, shows the cumulative percentage of all the entries up
 354 to and including this one. The entries shown here account for 96.8% of the
 355 instructions executed by the program.
 356 </para>
 357
 358 <para>
 359 The name <computeroutput>???</computeroutput> is used if the file name and/or
 360 function name could not be determined from debugging information. If
 361 <filename>???</filename> filenames dominate, the program probably wasn't
 362 compiled with <option>-g</option>. If <function>???</function> function names
 363 dominate, the program may have had symbols stripped.
 364 </para>
 365
 366 <para>
 367 After that comes function:file counts. Here is the first part of that section:
 368 </para>
 369
 370 <programlisting><![CDATA[
 371 --------------------------------------------------------------------------------
 372 -- Function:file summary
 373 --------------------------------------------------------------------------------
 374   Ir______________________  function:file
 375
 376 > 2,086,303 (25.5%, 25.5%)  get_word:
 377   1,630,232 (19.9%)           /home/njn/grind/ws1/cachegrind/concord.c
 378     456,071  (5.6%)           /usr/include/ctype.h
 379
 380 > 1,285,938 (15.7%, 41.1%)  _int_malloc:./malloc/./malloc/malloc.c
 381
 382 > 1,107,550 (13.5%, 54.7%)  getc:./libio/./libio/getc.c
 383
 384 >   630,918  (7.7%, 62.4%)  hash:/home/njn/grind/ws1/cachegrind/concord.c
 385
 386 >   551,071  (6.7%, 69.1%)  __strcmp_avx2:./string/../sysdeps/x86_64/multiarch/strcmp-avx2.S
 387
 388 >   480,248  (5.9%, 74.9%)  malloc:
 389     458,225  (5.6%)           ./malloc/./malloc/malloc.c
 390      22,023  (0.3%)           ./malloc/./malloc/arena.c
 391
 392 >   468,151  (5.7%, 80.7%)  ???:???
 393
 394 >   461,095  (5.6%, 86.3%)  insert:/home/njn/grind/ws1/cachegrind/concord.c
 395 ]]></programlisting>
 396
 397 <para>
 398 This is similar to the previous section, but is grouped by functions first and
 399 files second. Also, the entry markers are <computeroutput>&gt;</computeroutput>
 400 instead of <computeroutput>&lt;</computeroutput>.
 401 </para>
 402
 403 <para>
 404 You might wonder why this section is needed, and how it differs from the
 405 previous section. The answer is inlining. In this example there are two entries
 406 demonstrating a function whose code is effectively spread across more than one
 407 file: <function>get_word</function> and <function>malloc</function>. Here is an
 408 example from profiling the Rust compiler, a much larger program that uses
 409 inlining more:
 410 </para>
 411
 412 <programlisting><![CDATA[
 413 >  30,469,230 (1.3%, 11.1%)  <rustc_middle::ty::context::CtxtInterners>::intern_ty:
 414    10,269,220 (0.5%)           /home/njn/.cargo/registry/src/github.com-1ecc6299db9ec823/hashbrown-0.12.3/src/raw/mod.rs
 415     7,696,827 (0.3%)           /home/njn/dev/rust0/compiler/rustc_middle/src/ty/context.rs
 416     3,858,099 (0.2%)           /home/njn/dev/rust0/library/core/src/cell.rs
 417 ]]></programlisting>
 418
 419 <para>
 420 In this case the compiled function <function>intern_ty</function> includes code
 421 from three different source files, due to inlining. These should be examined
 422 together. Older versions of cg_annotate presented this entry as three separate
 423 file:function entries, which would typically be intermixed with all the other
 424 entries, making it hard to see that they are all really part of the same
 425 function.
 426 </para>
 427
 428 </sect2>
 429
 430
 431 <sect2 id="cg-manual.line-by-line" xreflabel="Per-line Counts">
 432 <title>Per-line Counts</title>
 433
 434 <para>
 435 By default, a source file is annotated if it contains at least one function
 436 that meets the significance threshold. This can be disabled with the
 437 <option>--annotate</option> option.
 438 </para>
 439
 440 <para>
 441 To continue the previous example, here is part of the annotation of the file
 442 <filename>concord.c</filename>:
 443 </para>
 444
 445 <programlisting><![CDATA[
 446 --------------------------------------------------------------------------------
 447 -- Annotated source file: /home/njn/grind/ws1/cachegrind/docs/concord.c
 448 --------------------------------------------------------------------------------
 449 Ir____________
 450
 451       .         /* Function builds the hash table from the given file. */
 452       .         void init_hash_table(char *file_name, Word_Node *table[])
 453       8 (0.0%)  {
 454       .             FILE *file_ptr;
 455       .             Word_Info *data;
 456       2 (0.0%)      int line = 1, i;
 457       .
 458       .             /* Structure used when reading in words and line numbers. */
 459       3 (0.0%)      data = (Word_Info *) create(sizeof(Word_Info));
 460       .
 461       .             /* Initialise entire table to NULL. */
 462   2,993 (0.0%)      for (i = 0; i < TABLE_SIZE; i++)
 463     997 (0.0%)          table[i] = NULL;
 464       .
 465       .             /* Open file, check it. */
 466       4 (0.0%)      file_ptr = fopen(file_name, "r");
 467       2 (0.0%)      if (!(file_ptr)) {
 468       .                 fprintf(stderr, "Couldn't open '%s'.\n", file_name);
 469       .                 exit(EXIT_FAILURE);
 470       .             }
 471       .
 472       .             /*  'Get' the words and lines one at a time from the file, and insert them
 473       .             ** into the table one at a time. */
 474  55,363 (0.7%)      while ((line = get_word(data, line, file_ptr)) != EOF)
 475  31,632 (0.4%)          insert(data->word, data->line, table);
 476       .
 477       2 (0.0%)      free(data);
 478       2 (0.0%)      fclose(file_ptr);
 479       6 (0.0%)  }
 480 ]]></programlisting>
 481
 482 <para>
 483 Each executed line is annotated with its event counts. Other lines are
 484 annotated with a dot. This may be because they contain no executable code, or
 485 they contain executable code but were never executed.
 486 </para>
 487
 488 <para>
 489 You can easily tell if a function is inlined from this output. If it is not
 490 inlined, it will have event counts on the lines containing the opening and
 491 closing braces. If it is inlined, it will not have event counts on those lines.
 492 In the example above, <function>init_hash_table</function> does have counts,
 493 so you can tell it is not inlined.
 494 </para>
 495
 496 <para>
 497 Note again that inlining can lead to surprising results. If a function
 498 <function>f</function> is always inlined, in the file:function and
 499 function:file sections counts will be attributed to the functions it is inlined
 500 into, rather than itself. However, if you look at the line-by-line annotations
 501 for <function>f</function> you'll see the counts that belong to
 502 <function>f</function>. So it's worth looking for large counts/percentages in the
 503 line-by-line annotations.
 504 </para>
 505
 506 <para>
 507 Sometimes only a small section of a source file is executed. To minimise
 508 uninteresting output, Cachegrind only shows annotated lines and lines within a
 509 small distance of annotated lines. Gaps are marked with line numbers, for
 510 example:
 511 </para>
 512
 513 <programlisting><![CDATA[
 514 (counts and code for line 704)
 515 -- line 375 ----------------------------------------
 516 -- line 514 ----------------------------------------
 517 (counts and code for line 878)
 518 ]]></programlisting>
 519
 520 <para>
 521 The number of lines of context shown around annotated lines is controlled by
 522 the <option>--context</option> option.
 523 </para>
 524
 525 <para>
 526 Any significant source files that could not be found are shown like this:
 527 </para>
 528
 529 <programlisting><![CDATA[
 530 --------------------------------------------------------------------------------
 531 -- Annotated source file: ./malloc/./malloc/malloc.c
 532 --------------------------------------------------------------------------------
 533 Unannotated because one or more of these original files are unreadable:
 534 - ./malloc/./malloc/malloc.c
 535 ]]></programlisting>
 536
 537 <para>
 538 This is common for library files, because libraries are usually compiled with
 539 debugging information but the source files are rarely present on a system.
 540 </para>
 541
 542 <para>
 543 Cachegrind relies heavily on accurate debug info. Sometimes compilers do not
 544 map a particular compiled instruction to line number 0, where the 0 represents
 545 "unknown" or "none". This is annoying but does happen in practice. cg_annotate
 546 prints these in the following way:
 547 </para>
 548
 549 <programlisting><![CDATA[
 550 --------------------------------------------------------------------------------
 551 -- Annotated source file: /home/njn/dev/rust0/compiler/rustc_borrowck/src/lib.rs
 552 --------------------------------------------------------------------------------
 553 Ir______________
 554
 555 1,046,746 (0.0%)  <unknown (line 0)>
 556 ]]></programlisting>
 557
 558 <para>
 559 Finally, when annotation is performed, the output ends with a summary of how
 560 many counts were annotated and unannotated, and why. For example:
 561 </para>
 562
 563 <programlisting><![CDATA[
 564 --------------------------------------------------------------------------------
 565 -- Annotation summary
 566 --------------------------------------------------------------------------------
 567 Ir_______________
 568
 569 3,534,817 (43.1%)    annotated: files known & above threshold & readable, line numbers known
 570         0            annotated: files known & above threshold & readable, line numbers unknown
 571         0          unannotated: files known & above threshold & two or more non-identical
 572 4,132,126 (50.4%)  unannotated: files known & above threshold & unreadable
 573    59,950  (0.7%)  unannotated: files known & below threshold
 574   468,163  (5.7%)  unannotated: files unknown
 575 ]]></programlisting>
 576
 577 </sect2>
 578
 579
 580 <sect2 id="cg-manual.forkingprograms" xreflabel="Forking Programs">
 581 <title>Forking Programs</title>
 582
 583 <para>
 584 If your program forks, the child will inherit all the profiling data that
 585 has been gathered for the parent.
 586 </para>
 587
 588 <para>
 589 If the output file name (controlled by <option>--cachegrind-out-file</option>)
 590 does not contain <option>%p</option>, then the outputs from the parent and
 591 child will be intermingled in a single output file, which will almost certainly
 592 make it unreadable by cg_annotate.
 593 </para>
 594
 595 </sect2>
 596
 597
 598 <sect2 id="cg-manual.annopts.warnings" xreflabel="cg_annotate Warnings">
 599 <title>cg_annotate Warnings</title>
 600
 601 <para>
 602 There are two situations in which cg_annotate prints warnings.
 603 </para>
 604
 605 <itemizedlist>
 606   <listitem>
 607     <para>
 608     If a source file is more recent than the Cachegrind output file. This is
 609     because the information in the Cachegrind output file is only recorded with
 610     line numbers, so if the line numbers change at all in the source (e.g.
 611     lines added, deleted, swapped), any annotations will be incorrect.
 612     </para>
 613   </listitem>
 614   <listitem>
 615     <para>
 616     If information is recorded about line numbers past the end of a file. This
 617     can be caused by the above problem, e.g. shortening the source file while
 618     using an old Cachegrind output file. If this happens, the figures for the
 619     bogus lines are printed anyway (and clearly marked as bogus) in case they
 620     are important.
 621     </para>
 622   </listitem>
 623 </itemizedlist>
 624
 625 </sect2>
 626
 627
 628 <sect2 id="cg-manual.cg_merge" xreflabel="cg_merge">
 629 <title>Merging Cachegrind Output Files</title>
 630
 631 <para>
 632 cg_annotate can merge data from multiple Cachegrind output files in a single
 633 run. (There is also a program called cg_merge that can merge multiple
 634 Cachegrind output files into a single Cachegrind output file, but it is now
 635 deprecated because cg_annotate's merging does a better job.)
 636 </para>
 637
 638 <para>
 639 Use it as follows:
 640 </para>
 641
 642 <programlisting><![CDATA[
 643 cg_annotate file1 file2 file3 ...
 644 ]]></programlisting>
 645
 646 <para>
 647 cg_annotate computes the sum of these files (effectively
 648 <filename>file1</filename> + <filename>file2</filename> +
 649 <filename>file3</filename>), and then produces output as usual that shows the
 650 summed counts.
 651 </para>
 652
 653 <para>
 654 The most common merging scenario is if you want to aggregate costs over
 655 multiple runs of the same program, possibly on different inputs.
 656 </para>
 657
 658 </sect2>
 659
 660
 661 <sect2 id="cg-manual.cg_diff" xreflabel="cg_diff">
 662 <title>Differencing Cachegrind output files</title>
 663
 664 <para>
 665 cg_annotate can diff data from two Cachegrind output files in a single run.
 666 (There is also a program called cg_diff that can diff two Cachegrind output
 667 files into a single Cachegrind output file, but it is now deprecated because
 668 cg_annotate's differencing does a better job.)
 669 </para>
 670
 671 <para>
 672 Use it as follows:
 673 </para>
 674
 675 <programlisting><![CDATA[
 676 cg_annotate --diff file1 file2
 677 ]]></programlisting>
 678
 679 <para>
 680 cg_annotate computes the difference between these two files (effectively
 681 <filename>file2</filename> - <filename>file1</filename>), and then
 682 produces output as usual that shows the count differences. Note that many of
 683 the counts may be negative; this indicates that the counts for the relevant
 684 file/function/line are smaller in the second version than those in the first
 685 version.
 686 </para>
 687
 688 <para>
 689 The simplest common scenario is comparing two Cachegrind output files that came
 690 from the same program, but on different inputs. cg_annotate will do a good job
 691 on this without assistance.
 692 </para>
 693
 694 <para>
 695 A more complex scenario is if you want to compare Cachegrind output files from
 696 two slightly different versions of a program that you have sitting
 697 side-by-side, running on the same input. For example, you might have
 698 <filename>version1/prog.c</filename> and <filename>version2/prog.c</filename>.
 699 A straight comparison of the two would not be useful. Because functions are
 700 always paired with filenames, a function <function>f</function> would be listed
 701 as <filename>version1/prog.c:f</filename> for the first version but
 702 <filename>version2/prog.c:f</filename> for the second version.
 703 </para>
 704
 705 <para>
 706 In this case, use the <option>--mod-filename</option> option. Its argument is a
 707 search-and-replace expression that will be applied to all the filenames in both
 708 Cachegrind output files.  It can be used to remove minor differences in
 709 filenames. For example, the option
 710 <option>--mod-filename='s/version[0-9]/versionN/'</option> will suffice for the
 711 above example.
 712 </para>
 713
 714 <para>
 715 Similarly, sometimes compilers auto-generate certain functions and give them
 716 randomized names like <function>T.1234</function> where the suffixes vary from
 717 build to build. You can use the <option>--mod-funcname</option> option to
 718 remove small differences like these; it works in the same way as
 719 <option>--mod-filename</option>.
 720 </para>
 721
 722 <para>
 723 When <option>--mod-filename</option> is used to compare two different versions
 724 of the same program, cg_annotate will not annotate any file that is different
 725 between the two versions, because the per-line counts are not reliable in such
 726 a case. For example, imagine if <filename>version2/prog.c</filename> is the
 727 same as <filename>version1/prog.c</filename> except with an extra blank line at
 728 the top of the file. Every single per-line count will have changed. In
 729 comparison, the per-file and per-function counts have not changed, and are
 730 still very useful for determining differences between programs. You might think
 731 that this means every interesting file will be left unannotated, but again
 732 inlining means that files that are identical in the two versions can have
 733 different counts on many lines.
 734 </para>
 735
 736
 737 </sect2>
 738
 739 <sect2 id="cg-manual.cache-branch-sim" xreflabel="cache-branch-sim">
 740 <title>Cache and Branch Simulation</title>
 741
 742 <para>
 743 Cachegrind can simulate how your program interacts with a machine's cache
 744 hierarchy and/or branch predictor.
 745
 746 The cache simulation models a machine with independent first-level instruction
 747 and data caches (I1 and D1), backed by a unified second-level cache (L2). For
 748 these machines (in the cases where Cachegrind can auto-detect the cache
 749 configuration) Cachegrind simulates the first-level and last-level caches.
 750 Therefore, Cachegrind always refers to the I1, D1 and LL (last-level) caches.
 751 </para>
 752
 753 <para>
 754 When simulating the cache, with <option>--cache-sim=yes</option>, Cachegrind
 755 gathers the following statistics:
 756 </para>
 757
 758 <itemizedlist>
 759   <listitem>
 760     <para>
 761     I cache reads (<computeroutput>Ir</computeroutput>, which equals the number
 762     of instructions executed), I1 cache read misses
 763     (<computeroutput>I1mr</computeroutput>) and LL cache instruction read
 764     misses (<computeroutput>ILmr</computeroutput>).
 765     </para>
 766   </listitem>
 767   <listitem>
 768     <para>
 769     D cache reads (<computeroutput>Dr</computeroutput>, which equals the number
 770     of memory reads), D1 cache read misses
 771     (<computeroutput>D1mr</computeroutput>), and LL cache data read misses
 772     (<computeroutput>DLmr</computeroutput>).
 773     </para>
 774   </listitem>
 775   <listitem>
 776     <para>
 777     D cache writes (<computeroutput>Dw</computeroutput>, which equals the
 778     number of memory writes), D1 cache write misses
 779     (<computeroutput>D1mw</computeroutput>), and LL cache data write misses
 780     (<computeroutput>DLmw</computeroutput>).
 781     </para>
 782   </listitem>
 783 </itemizedlist>
 784
 785 <para>
 786 Note that D1 total accesses is given by <computeroutput>D1mr</computeroutput> +
 787 <computeroutput>D1mw</computeroutput>, and that LL total accesses is given by
 788 <computeroutput>ILmr</computeroutput> + <computeroutput>DLmr</computeroutput> +
 789 <computeroutput>DLmw</computeroutput>.
 790 </para>
 791
 792 <para>
 793 When simulating the branch predictor, with <option>--branch-sim=yes</option>,
 794 Cachegrind gathers the following statistics:
 795 </para>
 796
 797 <itemizedlist>
 798   <listitem>
 799     <para>
 800     Conditional branches executed (<computeroutput>Bc</computeroutput>) and
 801     conditional branches mispredicted (<computeroutput>Bcm</computeroutput>).
 802     </para>
 803   </listitem>
 804   <listitem>
 805     <para>
 806     Indirect branches executed (<computeroutput>Bi</computeroutput>) and
 807     indirect branches mispredicted (<computeroutput>Bim</computeroutput>).
 808     </para>
 809   </listitem>
 810 </itemizedlist>
 811
 812 <para>
 813 When cache and/or branch simulation is enabled, cg_annotate will print multiple
 814 counts per line of output. For example:
 815 </para>
 816
 817 <programlisting><![CDATA[
 818   Ir______________________ Bc____________________ Bcm__________________ Bi____________________ Bim______________  function:file
 819
 820 >     8,547  (0.1%, 99.4%)     936  (0.1%, 99.1%)    177  (0.3%, 96.7%)      59  (0.0%, 99.9%) 38 (19.4%, 66.3%)  strcmp:
 821       8,503  (0.1%)            928  (0.1%)           175  (0.3%)             59  (0.0%)        38 (19.4%)           ./string/../sysdeps/x86_64/multiarch/../multiarch/strcmp-sse2.S
 822 ]]></programlisting>
 823
 824 </sect2>
 825
 826 </sect1>
 827
 828
 829 <sect1 id="cg-manual.cgopts" xreflabel="Cachegrind Command-line Options">
 830 <title>Cachegrind Command-line Options</title>
 831
 832 <!-- start of xi:include in the manpage -->
 833 <para>
 834 Cachegrind-specific options are:
 835 </para>
 836
 837 <variablelist id="cg.opts.list">
 838
 839   <varlistentry id="opt.cachegrind-out-file" xreflabel="--cachegrind-out-file">
 840     <term>
 841       <option><![CDATA[--cachegrind-out-file=<file> ]]></option>
 842     </term>
 843     <listitem>
 844       <para>
 845       Write the Cachegrind output file to <filename>file</filename> rather than
 846       to the default output file,
 847       <filename>cachegrind.out.&lt;pid&gt;</filename>. The <option>%p</option>
 848       and <option>%q</option> format specifiers can be used to embed the
 849       process ID and/or the contents of an environment variable in the name, as
 850       is the case for the core option
 851       <option><link linkend="opt.log-file">--log-file</link></option>.
 852       </para>
 853     </listitem>
 854   </varlistentry>
 855
 856   <varlistentry id="opt.cache-sim" xreflabel="--cache-sim">
 857     <term>
 858       <option><![CDATA[--cache-sim=no|yes [no] ]]></option>
 859     </term>
 860     <listitem>
 861       <para>
 862       Enables or disables collection of cache access and miss counts.
 863       </para>
 864     </listitem>
 865   </varlistentry>
 866
 867   <varlistentry id="opt.branch-sim" xreflabel="--branch-sim">
 868     <term>
 869       <option><![CDATA[--branch-sim=no|yes [no] ]]></option>
 870     </term>
 871     <listitem>
 872       <para>
 873       Enables or disables collection of branch instruction and
 874       misprediction counts.
 875       </para>
 876     </listitem>
 877   </varlistentry>
 878
 879   <varlistentry id="opt.instr-at-start" xreflabel="--instr-at-start">
 880     <term>
 881       <option><![CDATA[--instr-at-start=no|yes [yes] ]]></option>
 882     </term>
 883     <listitem>
 884       <para>
 885       Enables or disables instrumentation at the start of execution.
 886       Use this in combination with
 887       <computeroutput>CACHEGRIND_START_INSTRUMENTATION</computeroutput> and
 888       <computeroutput>CACHEGRIND_STOP_INSTRUMENTATION</computeroutput> to
 889       measure only part of a client program's execution.
 890       </para>
 891     </listitem>
 892   </varlistentry>
 893
 894   <varlistentry id="cg.opt.I1" xreflabel="--I1">
 895     <term>
 896       <option><![CDATA[--I1=<size>,<associativity>,<line size> ]]></option>
 897     </term>
 898     <listitem>
 899       <para>
 900       Specify the size, associativity and line size of the level 1 instruction
 901       cache. Only useful with <option>--cache-sim=yes</option>.
 902       </para>
 903     </listitem>
 904   </varlistentry>
 905
 906   <varlistentry id="cg.opt.D1" xreflabel="--D1">
 907     <term>
 908       <option><![CDATA[--D1=<size>,<associativity>,<line size> ]]></option>
 909     </term>
 910     <listitem>
 911       <para>
 912       Specify the size, associativity and line size of the level 1 data cache.
 913       Only useful with <option>--cache-sim=yes</option>.
 914       </para>
 915     </listitem>
 916   </varlistentry>
 917
 918   <varlistentry id="cg.opt.LL" xreflabel="--LL">
 919     <term>
 920       <option><![CDATA[--LL=<size>,<associativity>,<line size> ]]></option>
 921     </term>
 922     <listitem>
 923       <para>
 924       Specify the size, associativity and line size of the last-level cache.
 925       Only useful with <option>--cache-sim=yes</option>.
 926       </para>
 927     </listitem>
 928   </varlistentry>
 929
 930 </variablelist>
 931 <!-- end of xi:include in the manpage -->
 932
 933 </sect1>
 934
 935
 936
 937 <sect1 id="cg-manual.annopts" xreflabel="cg_annotate Command-line Options">
 938 <title>cg_annotate Command-line Options</title>
 939
 940 <!-- start of xi:include in the manpage -->
 941 <variablelist id="cg_annotate.opts.list">
 942
 943   <varlistentry>
 944     <term>
 945       <option><![CDATA[-h --help ]]></option>
 946     </term>
 947     <listitem>
 948       <para>Show the help message.</para>
 949     </listitem>
 950   </varlistentry>
 951
 952   <varlistentry>
 953     <term>
 954       <option><![CDATA[--version ]]></option>
 955     </term>
 956     <listitem>
 957       <para>Show the version number.</para>
 958     </listitem>
 959   </varlistentry>
 960
 961   <varlistentry>
 962     <term>
 963       <option><![CDATA[--diff ]]></option>
 964     </term>
 965     <listitem>
 966       <para>Diff two Cachegrind output files.</para>
 967     </listitem>
 968   </varlistentry>
 969
 970   <varlistentry>
 971     <term>
 972       <option><![CDATA[--mod-filename <regex> [default: none]]]></option>
 973     </term>
 974     <listitem>
 975       <para>
 976       Specifies an <option>s/old/new/</option> search-and-replace expression
 977       that is applied to all filenames. Useful when differencing, for removing
 978       minor differences in paths between two different versions of a program
 979       that are sitting in different directories. An <option>i</option> suffix
 980       makes the regex case-insensitive, and a <option>g</option> suffix makes
 981       it match multiple times.
 982       </para>
 983     </listitem>
 984   </varlistentry>
 985
 986   <varlistentry>
 987     <term>
 988       <option><![CDATA[--mod-funcname <regex> [default: none]]]></option>
 989     </term>
 990     <listitem>
 991       <para>
 992       Like <option>--mod-filename</option>, but for filenames. Useful for
 993       removing minor differences in randomized names of auto-generated
 994       functions generated by some compilers.
 995       </para>
 996     </listitem>
 997   </varlistentry>
 998
 999   <varlistentry>
1000     <term>
1001       <option><![CDATA[--show=A,B,C [default: all, using order in
1002       the Cachegrind output file] ]]></option>
1003     </term>
1004     <listitem>
1005       <para>
1006       Specifies which events to show (and the column order). Default is to use
1007       all present in the Cachegrind output file (and use the order in the
1008       file). Best used in conjunction with <option>--sort</option>.
1009       </para>
1010     </listitem>
1011   </varlistentry>
1012
1013   <varlistentry>
1014     <term>
1015       <option><![CDATA[--sort=A,B,C [default: order in the Cachegrind output file] ]]></option>
1016     </term>
1017     <listitem>
1018       <para>
1019       Specifies the events upon which the sorting of the file:function and
1020       function:file entries will be based.
1021       </para>
1022     </listitem>
1023   </varlistentry>
1024
1025   <varlistentry>
1026     <term>
1027       <option><![CDATA[--threshold=X [default: 0.1%] ]]></option>
1028     </term>
1029     <listitem>
1030       <para>
1031       Sets the significance threshold for the file:function and function:files
1032       sections. A file or function is shown if it accounts for more than X% of
1033       the counts for the primary sort event.  If annotating source files, this
1034       also affects which files are annotated.
1035       </para>
1036     </listitem>
1037   </varlistentry>
1038
1039   <varlistentry>
1040     <term>
1041       <option><![CDATA[--show-percs, --no-show-percs, --show-percs=<no|yes> [default: yes] ]]></option>
1042     </term>
1043     <listitem>
1044       <para>
1045       When enabled, a percentage is printed next to all event counts. This
1046       helps gauge the relative importance of each function and line.
1047       </para>
1048     </listitem>
1049   </varlistentry>
1050
1051   <varlistentry>
1052     <term>
1053       <option><![CDATA[--annotate, --no-annotate, --auto=<no|yes> [default: yes] ]]></option>
1054     </term>
1055     <listitem>
1056       <para>
1057       Enables or disables source file annotation.
1058       </para>
1059     </listitem>
1060   </varlistentry>
1061
1062   <varlistentry>
1063     <term>
1064       <option><![CDATA[--context=N [default: 8] ]]></option>
1065     </term>
1066     <listitem>
1067       <para>
1068       The number of lines of context to show before and after each annotated
1069       line. Use a large number (e.g. 100000) to show all source lines.
1070       </para>
1071     </listitem>
1072   </varlistentry>
1073
1074 </variablelist>
1075 <!-- end of xi:include in the manpage -->
1076
1077 </sect1>
1078
1079
1080 <sect1 id="cg-manual.mergeopts" xreflabel="cg_merge Command-line Options">
1081 <title>cg_merge Command-line Options</title>
1082
1083 <!-- start of xi:include in the manpage -->
1084 <variablelist id="cg_merge.opts.list">
1085
1086   <varlistentry>
1087     <term>
1088       <option><![CDATA[-o outfile]]></option>
1089     </term>
1090     <listitem>
1091       <para>
1092       Write the output to to <computeroutput>outfile</computeroutput>
1093       instead of standard output.
1094       </para>
1095     </listitem>
1096   </varlistentry>
1097
1098 </variablelist>
1099 <!-- end of xi:include in the manpage -->
1100
1101 </sect1>
1102
1103
1104 <sect1 id="cg-manual.diffopts" xreflabel="cg_diff Command-line Options">
1105 <title>cg_diff Command-line Options</title>
1106
1107 <!-- start of xi:include in the manpage -->
1108 <variablelist id="cg_diff.opts.list">
1109
1110   <varlistentry>
1111     <term>
1112       <option><![CDATA[-h --help ]]></option>
1113     </term>
1114     <listitem>
1115       <para>Show the help message.</para>
1116     </listitem>
1117   </varlistentry>
1118
1119   <varlistentry>
1120     <term>
1121       <option><![CDATA[--version ]]></option>
1122     </term>
1123     <listitem>
1124       <para>Show the version number.</para>
1125     </listitem>
1126   </varlistentry>
1127
1128   <varlistentry>
1129     <term>
1130       <option><![CDATA[--mod-filename=<expr> [default: none]]]></option>
1131     </term>
1132     <listitem>
1133       <para>
1134       Specifies an <option>s/old/new/</option> search-and-replace expression
1135       that is applied to all filenames.
1136       </para>
1137     </listitem>
1138   </varlistentry>
1139
1140   <varlistentry>
1141     <term>
1142       <option><![CDATA[--mod-funcname=<expr> [default: none]]]></option>
1143     </term>
1144     <listitem>
1145       <para>
1146       Like <option>--mod-filename</option>, but for filenames.
1147       </para>
1148     </listitem>
1149   </varlistentry>
1150
1151 </variablelist>
1152 <!-- end of xi:include in the manpage -->
1153
1154 </sect1>
1155
1156
1157 <sect1 id="cg-manual.clientrequests" xreflabel="Client requests">
1158 <title>Cachegrind Client Requests</title>
1159
1160 <para>Cachegrind provides the following client requests in
1161 <filename>cachegrind.h</filename>.
1162 </para>
1163
1164 <variablelist id="cg.clientrequests.list">
1165
1166   <varlistentry id="cg.cr.start-instr" xreflabel="CACHEGRIND_START_INSTRUMENTATION">
1167     <term>
1168       <computeroutput>CACHEGRIND_START_INSTRUMENTATION</computeroutput>
1169     </term>
1170     <listitem>
1171       <para>Start Cachegrind instrumentation if not already enabled. Use this
1172       in combination with
1173       <computeroutput>CACHEGRIND_STOP_INSTRUMENTATION</computeroutput> and
1174       <option><link linkend="opt.instr-at-start">--instr-at-start</link></option>
1175       to measure only part of a client program's execution.
1176       </para>
1177     </listitem>
1178   </varlistentry>
1179
1180   <varlistentry id="cg.cr.stop-instr" xreflabel="CACHEGRIND_STOP_INSTRUMENTATION">
1181     <term>
1182       <computeroutput>CACHEGRIND_STOP_INSTRUMENTATION</computeroutput>
1183     </term>
1184     <listitem>
1185       <para>Stop Cachegrind instrumentation if not already disabled. Use this
1186       in combination with
1187       <computeroutput>CACHEGRIND_START_INSTRUMENTATION</computeroutput> and
1188       <option><link linkend="opt.instr-at-start">--instr-at-start</link></option>
1189       to measure only part of a client program's execution.
1190       </para>
1191     </listitem>
1192   </varlistentry>
1193
1194 </variablelist>
1195
1196 </sect1>
1197
1198
1199 <sect1 id="cg-manual.sim-details"
1200        xreflabel="Simulation Details">
1201 <title>Simulation Details</title>
1202 <para>
1203 This section talks about details you don't need to know about in order to
1204 use Cachegrind, but may be of interest to some people.
1205 </para>
1206
1207 <sect2 id="cache-sim" xreflabel="Cache Simulation Specifics">
1208 <title>Cache Simulation Specifics</title>
1209
1210 <para>
1211 The cache simulation approximates the hardware of an AMD Athlon CPU circa 2002.
1212 Its specific characteristics are as follows:</para>
1213
1214 <itemizedlist>
1215
1216   <listitem>
1217     <para>Write-allocate: when a write miss occurs, the block
1218     written to is brought into the D1 cache.  Most modern caches
1219     have this property.</para>
1220   </listitem>
1221
1222   <listitem>
1223     <para>Bit-selection hash function: the set of line(s) in the cache
1224     to which a memory block maps is chosen by the middle bits
1225     M--(M+N-1) of the byte address, where:</para>
1226     <itemizedlist>
1227       <listitem>
1228         <para>line size = 2^M bytes</para>
1229       </listitem>
1230       <listitem>
1231         <para>(cache size / line size / associativity) = 2^N bytes</para>
1232       </listitem>
1233     </itemizedlist>
1234   </listitem>
1235
1236   <listitem>
1237     <para>Inclusive LL cache: the LL cache typically replicates all
1238     the entries of the L1 caches, because fetching into L1 involves
1239     fetching into LL first (this does not guarantee strict inclusiveness,
1240     as lines evicted from LL still could reside in L1).  This is
1241     standard on Pentium chips, but AMD Opterons, Athlons and Durons
1242     use an exclusive LL cache that only holds
1243     blocks evicted from L1.  Ditto most modern VIA CPUs.</para>
1244   </listitem>
1245
1246 </itemizedlist>
1247
1248 <para>The cache configuration simulated (cache size,
1249 associativity and line size) is determined automatically using
1250 the x86 CPUID instruction.  If you have a machine that (a)
1251 doesn't support the CPUID instruction, or (b) supports it in an
1252 early incarnation that doesn't give any cache information, then
1253 Cachegrind will fall back to using a default configuration (that
1254 of a model 3/4 Athlon).  Cachegrind will tell you if this
1255 happens.  You can manually specify one, two or all three levels
1256 (I1/D1/LL) of the cache from the command line using the
1257 <option>--I1</option>,
1258 <option>--D1</option> and
1259 <option>--LL</option> options.
1260 For cache parameters to be valid for simulation, the number
1261 of sets (with associativity being the number of cache lines in
1262 each set) has to be a power of two.</para>
1263
1264 <para>On PowerPC platforms
1265 Cachegrind cannot automatically
1266 determine the cache configuration, so you will
1267 need to specify it with the
1268 <option>--I1</option>,
1269 <option>--D1</option> and
1270 <option>--LL</option> options.</para>
1271
1272
1273 <para>Other noteworthy behaviour:</para>
1274
1275 <itemizedlist>
1276   <listitem>
1277     <para>References that straddle two cache lines are treated as
1278     follows:</para>
1279     <itemizedlist>
1280       <listitem>
1281         <para>If both blocks hit --&gt; counted as one hit</para>
1282       </listitem>
1283       <listitem>
1284         <para>If one block hits, the other misses --&gt; counted
1285         as one miss.</para>
1286       </listitem>
1287       <listitem>
1288         <para>If both blocks miss --&gt; counted as one miss (not
1289         two)</para>
1290       </listitem>
1291     </itemizedlist>
1292   </listitem>
1293
1294   <listitem>
1295     <para>Instructions that modify a memory location
1296     (e.g. <computeroutput>inc</computeroutput> and
1297     <computeroutput>dec</computeroutput>) are counted as doing
1298     just a read, i.e. a single data reference.  This may seem
1299     strange, but since the write can never cause a miss (the read
1300     guarantees the block is in the cache) it's not very
1301     interesting.</para>
1302
1303     <para>Thus it measures not the number of times the data cache
1304     is accessed, but the number of times a data cache miss could
1305     occur.</para>
1306   </listitem>
1307
1308 </itemizedlist>
1309
1310 <para>
1311 If you are interested in simulating a cache with different properties, it is
1312 not particularly hard to write your own cache simulator, or to modify the
1313 existing ones in <computeroutput>cg_sim.c</computeroutput>.
1314 </para>
1315
1316 </sect2>
1317
1318
1319 <sect2 id="branch-sim" xreflabel="Branch Simulation Specifics">
1320 <title>Branch Simulation Specifics</title>
1321
1322 <para>Cachegrind simulates branch predictors intended to be
1323 typical of mainstream desktop/server processors of around 2004.</para>
1324
1325 <para>Conditional branches are predicted using an array of 16384 2-bit
1326 saturating counters.  The array index used for a branch instruction is
1327 computed partly from the low-order bits of the branch instruction's
1328 address and partly using the taken/not-taken behaviour of the last few
1329 conditional branches.  As a result the predictions for any specific
1330 branch depend both on its own history and the behaviour of previous
1331 branches.  This is a standard technique for improving prediction
1332 accuracy.</para>
1333
1334 <para>For indirect branches (that is, jumps to unknown destinations)
1335 Cachegrind uses a simple branch target address predictor.  Targets are
1336 predicted using an array of 512 entries indexed by the low order 9
1337 bits of the branch instruction's address.  Each branch is predicted to
1338 jump to the same address it did last time.  Any other behaviour causes
1339 a mispredict.</para>
1340
1341 <para>More recent processors have better branch predictors, in
1342 particular better indirect branch predictors.  Cachegrind's predictor
1343 design is deliberately conservative so as to be representative of the
1344 large installed base of processors which pre-date widespread
1345 deployment of more sophisticated indirect branch predictors.  In
1346 particular, late model Pentium 4s (Prescott), Pentium M, Core and Core
1347 2 have more sophisticated indirect branch predictors than modelled by
1348 Cachegrind.  </para>
1349
1350 <para>Cachegrind does not simulate a return stack predictor.  It
1351 assumes that processors perfectly predict function return addresses,
1352 an assumption which is probably close to being true.</para>
1353
1354 <para>See Hennessy and Patterson's classic text "Computer
1355 Architecture: A Quantitative Approach", 4th edition (2007), Section
1356 2.3 (pages 80-89) for background on modern branch predictors.</para>
1357
1358 </sect2>
1359
1360 <sect2 id="cg-manual.annopts.accuracy" xreflabel="Accuracy">
1361 <title>Accuracy</title>
1362
1363 <para>
1364 Cachegrind's instruction counting has one shortcoming on x86/amd64:
1365 </para>
1366
1367 <itemizedlist>
1368   <listitem>
1369     <para>
1370     When a <function>REP</function>-prefixed instruction executes each
1371     iteration is counted separately. In contrast, hardware counters count each
1372     such instruction just once, no matter how many times it iterates. It is
1373     arguable that Cachegrind's behaviour is more useful.
1374     </para>
1375   </listitem>
1376 </itemizedlist>
1377
1378 <para>
1379 Cachegrind's cache profiling has a number of shortcomings:
1380 </para>
1381
1382 <itemizedlist>
1383   <listitem>
1384     <para>
1385     It doesn't account for kernel activity. The effect of system calls on the
1386     cache and branch predictor contents is ignored.
1387     </para>
1388   </listitem>
1389
1390   <listitem>
1391     <para>
1392     It doesn't account for other process activity. This is arguably desirable
1393     when considering a single program.
1394     </para>
1395   </listitem>
1396
1397   <listitem>
1398     <para>It doesn't account for virtual-to-physical address
1399     mappings.  Hence the simulation is not a true
1400     representation of what's happening in the
1401     cache.  Most caches and branch predictors are physically indexed, but
1402     Cachegrind simulates caches using virtual addresses.</para>
1403   </listitem>
1404
1405   <listitem>
1406     <para>It doesn't account for cache misses not visible at the
1407     instruction level, e.g. those arising from TLB misses, or
1408     speculative execution.</para>
1409   </listitem>
1410
1411   <listitem>
1412     <para>Valgrind will schedule
1413     threads differently from how they would be when running natively.
1414     This could warp the results for threaded programs.</para>
1415   </listitem>
1416
1417   <listitem>
1418     <para>
1419     The x86/amd64 instructions <computeroutput>bts</computeroutput>,
1420     <computeroutput>btr</computeroutput> and
1421     <computeroutput>btc</computeroutput> will incorrectly be counted as doing a
1422     data read if both the arguments are registers, e.g.:
1423     <programlisting><![CDATA[
1424     btsl %eax, %edx]]></programlisting>
1425     This should only happen rarely.
1426     </para>
1427   </listitem>
1428
1429   <listitem>
1430     <para>x86/amd64 FPU instructions with data sizes of 28 and 108 bytes
1431     (e.g.  <computeroutput>fsave</computeroutput>) are treated as
1432     though they only access 16 bytes.  These instructions seem to
1433     be rare so hopefully this won't affect accuracy much.</para>
1434   </listitem>
1435
1436 </itemizedlist>
1437
1438 <para>Another thing worth noting is that results are very sensitive.
1439 Changing the size of the executable being profiled, or the sizes
1440 of any of the shared libraries it uses, or even the length of their
1441 file names, can perturb the results.  Variations will be small, but
1442 don't expect perfectly repeatable results if your program changes at
1443 all.</para>
1444
1445 <para>
1446 Many Linux distributions perform address space layout randomisation (ASLR), in
1447 which identical runs of the same program have their shared libraries loaded at
1448 different locations, as a security measure. This also perturbs the
1449 results.
1450 </para>
1451
1452 </sect2>
1453
1454 </sect1>
1455
1456
1457
1458 <sect1 id="cg-manual.impl-details"
1459        xreflabel="Implementation Details">
1460 <title>Implementation Details</title>
1461 <para>
1462 This section talks about details you don't need to know about in order to
1463 use Cachegrind, but may be of interest to some people.
1464 </para>
1465
1466 <sect2 id="cg-manual.impl-details.how-cg-works"
1467        xreflabel="How Cachegrind Works">
1468 <title>How Cachegrind Works</title>
1469 <para>The best reference for understanding how Cachegrind works is chapter 3 of
1470 "Dynamic Binary Analysis and Instrumentation", by Nicholas Nethercote.  It
1471 is available on the <ulink url="&vg-pubs-url;">Valgrind publications
1472 page</ulink>.</para>
1473 </sect2>
1474
1475 <sect2 id="cg-manual.impl-details.file-format"
1476        xreflabel="Cachegrind Output File Format">
1477 <title>Cachegrind Output File Format</title>
1478 <para>The file format is fairly straightforward, basically giving the
1479 cost centre for every line, grouped by files and
1480 functions.  It's also totally generic and self-describing, in the sense that
1481 it can be used for any events that can be counted on a line-by-line basis,
1482 not just cache and branch predictor events.  For example, earlier versions
1483 of Cachegrind didn't have a branch predictor simulation.  When this was
1484 added, the file format didn't need to change at all.  So the format (and
1485 consequently, cg_annotate) could be used by other tools.</para>
1486
1487 <para>The file format:</para>
1488 <programlisting><![CDATA[
1489 file         ::= desc_line* cmd_line events_line data_line+ summary_line
1490 desc_line    ::= "desc:" ws? non_nl_string
1491 cmd_line     ::= "cmd:" ws? cmd
1492 events_line  ::= "events:" ws? (event ws)+
1493 data_line    ::= file_line | fn_line | count_line
1494 file_line    ::= "fl=" filename
1495 fn_line      ::= "fn=" fn_name
1496 count_line   ::= line_num (ws+ count)* ws*
1497 summary_line ::= "summary:" ws? count (ws+ count)+ ws*
1498 count        ::= num]]></programlisting>
1499
1500 <para>Where:</para>
1501 <itemizedlist>
1502   <listitem>
1503     <para><computeroutput>non_nl_string</computeroutput> is any
1504     string not containing a newline.</para>
1505   </listitem>
1506   <listitem>
1507     <para><computeroutput>cmd</computeroutput> is a string holding the
1508     command line of the profiled program.</para>
1509   </listitem>
1510   <listitem>
1511     <para><computeroutput>event</computeroutput> is a string containing
1512     no whitespace.</para>
1513   </listitem>
1514   <listitem>
1515     <para><computeroutput>filename</computeroutput> and
1516     <computeroutput>fn_name</computeroutput> are strings.</para>
1517   </listitem>
1518   <listitem>
1519     <para><computeroutput>num</computeroutput> and
1520     <computeroutput>line_num</computeroutput> are decimal
1521     numbers.</para>
1522   </listitem>
1523   <listitem>
1524     <para><computeroutput>ws</computeroutput> is whitespace.</para>
1525   </listitem>
1526 </itemizedlist>
1527
1528 <para>The contents of the "desc:" lines are printed out at the top
1529 of the summary.  This is a generic way of providing simulation
1530 specific information, e.g. for giving the cache configuration for
1531 cache simulation.</para>
1532
1533 <para>More than one line of info can be present for each file/fn/line number.
1534 In such cases, the counts for the named events will be accumulated.</para>
1535
1536 <para>The number of counts in each
1537 <computeroutput>line</computeroutput> and the
1538 <computeroutput>summary_line</computeroutput> should not exceed
1539 the number of events in the
1540 <computeroutput>event_line</computeroutput>.  If the number in
1541 each <computeroutput>line</computeroutput> is less, cg_annotate
1542 treats those missing as though they were a "0" entry. This can reduce
1543 file size.
1544 </para>
1545
1546 <para>A <computeroutput>file_line</computeroutput> changes the
1547 current file name.  A <computeroutput>fn_line</computeroutput>
1548 changes the current function name.  A
1549 <computeroutput>count_line</computeroutput> contains counts that
1550 pertain to the current filename/fn_name.  A "fn="
1551 <computeroutput>file_line</computeroutput> and a
1552 <computeroutput>fn_line</computeroutput> must appear before any
1553 <computeroutput>count_line</computeroutput>s to give the context
1554 of the first <computeroutput>count_line</computeroutput>s.</para>
1555
1556 <para>Similarly, each <computeroutput>file_line</computeroutput> must be
1557 immediately followed by a <computeroutput>fn_line</computeroutput>.
1558 </para>
1559
1560 <para>The summary line is redundant, because it just holds the total counts
1561 for each event.  But this serves as a useful sanity check of the data;  if
1562 the totals for each event don't match the summary line, something has gone
1563 wrong.</para>
1564
1565 </sect2>
1566
1567 </sect1>
1568 </chapter>