llvm/docs/CommandGuide/llvm-mca.rst

   1 llvm-mca - LLVM Machine Code Analyzer
   2 =====================================
   3
   4 .. program:: llvm-mca
   5
   6 SYNOPSIS
   7 --------
   8
   9 :program:`llvm-mca` [*options*] [input]
  10
  11 DESCRIPTION
  12 -----------
  13
  14 :program:`llvm-mca` is a performance analysis tool that uses information
  15 available in LLVM (e.g. scheduling models) to statically measure the performance
  16 of machine code in a specific CPU.
  17
  18 Performance is measured in terms of throughput as well as processor resource
  19 consumption. The tool currently works for processors with a backend for which
  20 there is a scheduling model available in LLVM.
  21
  22 The main goal of this tool is not just to predict the performance of the code
  23 when run on the target, but also help with diagnosing potential performance
  24 issues.
  25
  26 Given an assembly code sequence, :program:`llvm-mca` estimates the Instructions
  27 Per Cycle (IPC), as well as hardware resource pressure. The analysis and
  28 reporting style were inspired by the IACA tool from Intel.
  29
  30 For example, you can compile code with clang, output assembly, and pipe it
  31 directly into :program:`llvm-mca` for analysis:
  32
  33 .. code-block:: bash
  34
  35   $ clang foo.c -O2 --target=x86_64 -S -o - | llvm-mca -mcpu=btver2
  36
  37 Or for Intel syntax:
  38
  39 .. code-block:: bash
  40
  41   $ clang foo.c -O2 --target=x86_64 -masm=intel -S -o - | llvm-mca -mcpu=btver2
  42
  43 (:program:`llvm-mca` detects Intel syntax by the presence of an `.intel_syntax`
  44 directive at the beginning of the input.  By default its output syntax matches
  45 that of its input.)
  46
  47 Scheduling models are not just used to compute instruction latencies and
  48 throughput, but also to understand what processor resources are available
  49 and how to simulate them.
  50
  51 By design, the quality of the analysis conducted by :program:`llvm-mca` is
  52 inevitably affected by the quality of the scheduling models in LLVM.
  53
  54 If you see that the performance report is not accurate for a processor,
  55 please `file a bug <https://github.com/llvm/llvm-project/issues>`_
  56 against the appropriate backend.
  57
  58 OPTIONS
  59 -------
  60
  61 If ``input`` is "``-``" or omitted, :program:`llvm-mca` reads from standard
  62 input. Otherwise, it will read from the specified filename.
  63
  64 If the :option:`-o` option is omitted, then :program:`llvm-mca` will send its output
  65 to standard output if the input is from standard input.  If the :option:`-o`
  66 option specifies "``-``", then the output will also be sent to standard output.
  67
  68
  69 .. option:: -help
  70
  71  Print a summary of command line options.
  72
  73 .. option:: -o <filename>
  74
  75  Use ``<filename>`` as the output filename. See the summary above for more
  76  details.
  77
  78 .. option:: -mtriple=<target triple>
  79
  80  Specify a target triple string.
  81
  82 .. option:: -march=<arch>
  83
  84  Specify the architecture for which to analyze the code. It defaults to the
  85  host default target.
  86
  87 .. option:: -mcpu=<cpuname>
  88
  89   Specify the processor for which to analyze the code.  By default, the cpu name
  90   is autodetected from the host.
  91
  92 .. option:: -output-asm-variant=<variant id>
  93
  94  Specify the output assembly variant for the report generated by the tool.
  95  On x86, possible values are [0, 1]. A value of 0 (vic. 1) for this flag enables
  96  the AT&T (vic. Intel) assembly format for the code printed out by the tool in
  97  the analysis report.
  98
  99 .. option:: -print-imm-hex
 100
 101  Prefer hex format for numeric literals in the output assembly printed as part
 102  of the report.
 103
 104 .. option:: -dispatch=<width>
 105
 106  Specify a different dispatch width for the processor. The dispatch width
 107  defaults to field 'IssueWidth' in the processor scheduling model.  If width is
 108  zero, then the default dispatch width is used.
 109
 110 .. option:: -register-file-size=<size>
 111
 112  Specify the size of the register file. When specified, this flag limits how
 113  many physical registers are available for register renaming purposes. A value
 114  of zero for this flag means "unlimited number of physical registers".
 115
 116 .. option:: -iterations=<number of iterations>
 117
 118  Specify the number of iterations to run. If this flag is set to 0, then the
 119  tool sets the number of iterations to a default value (i.e. 100).
 120
 121 .. option:: -noalias=<bool>
 122
 123   If set, the tool assumes that loads and stores don't alias. This is the
 124   default behavior.
 125
 126 .. option:: -lqueue=<load queue size>
 127
 128   Specify the size of the load queue in the load/store unit emulated by the tool.
 129   By default, the tool assumes an unbound number of entries in the load queue.
 130   A value of zero for this flag is ignored, and the default load queue size is
 131   used instead.
 132
 133 .. option:: -squeue=<store queue size>
 134
 135   Specify the size of the store queue in the load/store unit emulated by the
 136   tool. By default, the tool assumes an unbound number of entries in the store
 137   queue. A value of zero for this flag is ignored, and the default store queue
 138   size is used instead.
 139
 140 .. option:: -timeline
 141
 142   Enable the timeline view.
 143
 144 .. option:: -timeline-max-iterations=<iterations>
 145
 146   Limit the number of iterations to print in the timeline view. By default, the
 147   timeline view prints information for up to 10 iterations.
 148
 149 .. option:: -timeline-max-cycles=<cycles>
 150
 151   Limit the number of cycles in the timeline view, or use 0 for no limit. By
 152   default, the number of cycles is set to 80.
 153
 154 .. option:: -resource-pressure
 155
 156   Enable the resource pressure view. This is enabled by default.
 157
 158 .. option:: -register-file-stats
 159
 160   Enable register file usage statistics.
 161
 162 .. option:: -dispatch-stats
 163
 164   Enable extra dispatch statistics. This view collects and analyzes instruction
 165   dispatch events, as well as static/dynamic dispatch stall events. This view
 166   is disabled by default.
 167
 168 .. option:: -scheduler-stats
 169
 170   Enable extra scheduler statistics. This view collects and analyzes instruction
 171   issue events. This view is disabled by default.
 172
 173 .. option:: -retire-stats
 174
 175   Enable extra retire control unit statistics. This view is disabled by default.
 176
 177 .. option:: -instruction-info
 178
 179   Enable the instruction info view. This is enabled by default.
 180
 181 .. option:: -show-encoding
 182
 183   Enable the printing of instruction encodings within the instruction info view.
 184
 185 .. option:: -show-barriers
 186
 187   Enable the printing of LoadBarrier and StoreBarrier flags within the
 188   instruction info view.
 189
 190 .. option:: -all-stats
 191
 192   Print all hardware statistics. This enables extra statistics related to the
 193   dispatch logic, the hardware schedulers, the register file(s), and the retire
 194   control unit. This option is disabled by default.
 195
 196 .. option:: -all-views
 197
 198   Enable all the view.
 199
 200 .. option:: -instruction-tables
 201
 202   Prints resource pressure information based on the static information
 203   available from the processor model. This differs from the resource pressure
 204   view because it doesn't require that the code is simulated. It instead prints
 205   the theoretical uniform distribution of resource pressure for every
 206   instruction in sequence.
 207
 208 .. option:: -bottleneck-analysis
 209
 210   Print information about bottlenecks that affect the throughput. This analysis
 211   can be expensive, and it is disabled by default. Bottlenecks are highlighted
 212   in the summary view. Bottleneck analysis is currently not supported for
 213   processors with an in-order backend.
 214
 215 .. option:: -json
 216
 217   Print the requested views in valid JSON format. The instructions and the
 218   processor resources are printed as members of special top level JSON objects.
 219   The individual views refer to them by index. However, not all views are
 220   currently supported. For example, the report from the bottleneck analysis is
 221   not printed out in JSON. All the default views are currently supported.
 222
 223 .. option:: -disable-cb
 224
 225   Force usage of the generic CustomBehaviour and InstrPostProcess classes rather
 226   than using the target specific implementation. The generic classes never
 227   detect any custom hazards or make any post processing modifications to
 228   instructions.
 229
 230 .. option:: -disable-im
 231
 232   Force usage of the generic InstrumentManager rather than using the target
 233   specific implementation. The generic class creates Instruments that provide
 234   no extra information, and InstrumentManager never overrides the default
 235   schedule class for a given instruction.
 236
 237 EXIT STATUS
 238 -----------
 239
 240 :program:`llvm-mca` returns 0 on success. Otherwise, an error message is printed
 241 to standard error, and the tool returns 1.
 242
 243 USING MARKERS TO ANALYZE SPECIFIC CODE BLOCKS
 244 ---------------------------------------------
 245 :program:`llvm-mca` allows for the optional usage of special code comments to
 246 mark regions of the assembly code to be analyzed.  A comment starting with
 247 substring ``LLVM-MCA-BEGIN`` marks the beginning of an analysis region. A
 248 comment starting with substring ``LLVM-MCA-END`` marks the end of a region.
 249 For example:
 250
 251 .. code-block:: none
 252
 253   # LLVM-MCA-BEGIN
 254     ...
 255   # LLVM-MCA-END
 256
 257 If no user-defined region is specified, then :program:`llvm-mca` assumes a
 258 default region which contains every instruction in the input file.  Every region
 259 is analyzed in isolation, and the final performance report is the union of all
 260 the reports generated for every analysis region.
 261
 262 Analysis regions can have names. For example:
 263
 264 .. code-block:: none
 265
 266   # LLVM-MCA-BEGIN A simple example
 267     add %eax, %eax
 268   # LLVM-MCA-END
 269
 270 The code from the example above defines a region named "A simple example" with a
 271 single instruction in it. Note how the region name doesn't have to be repeated
 272 in the ``LLVM-MCA-END`` directive. In the absence of overlapping regions,
 273 an anonymous ``LLVM-MCA-END`` directive always ends the currently active user
 274 defined region.
 275
 276 Example of nesting regions:
 277
 278 .. code-block:: none
 279
 280   # LLVM-MCA-BEGIN foo
 281     add %eax, %edx
 282   # LLVM-MCA-BEGIN bar
 283     sub %eax, %edx
 284   # LLVM-MCA-END bar
 285   # LLVM-MCA-END foo
 286
 287 Example of overlapping regions:
 288
 289 .. code-block:: none
 290
 291   # LLVM-MCA-BEGIN foo
 292     add %eax, %edx
 293   # LLVM-MCA-BEGIN bar
 294     sub %eax, %edx
 295   # LLVM-MCA-END foo
 296     add %eax, %edx
 297   # LLVM-MCA-END bar
 298
 299 Note that multiple anonymous regions cannot overlap. Also, overlapping regions
 300 cannot have the same name.
 301
 302 There is no support for marking regions from high-level source code, like C or
 303 C++. As a workaround, inline assembly directives may be used:
 304
 305 .. code-block:: c++
 306
 307   int foo(int a, int b) {
 308     __asm volatile("# LLVM-MCA-BEGIN foo":::"memory");
 309     a += 42;
 310     __asm volatile("# LLVM-MCA-END":::"memory");
 311     a *= b;
 312     return a;
 313   }
 314
 315 However, this interferes with optimizations like loop vectorization and may have
 316 an impact on the code generated. This is because the ``__asm`` statements are
 317 seen as real code having important side effects, which limits how the code
 318 around them can be transformed. If users want to make use of inline assembly
 319 to emit markers, then the recommendation is to always verify that the output
 320 assembly is equivalent to the assembly generated in the absence of markers.
 321 The `Clang options to emit optimization reports <https://clang.llvm.org/docs/UsersManual.html#options-to-emit-optimization-reports>`_
 322 can also help in detecting missed optimizations.
 323
 324 INSTRUMENT REGIONS
 325 ------------------
 326
 327 An InstrumentRegion describes a region of assembly code guarded by
 328 special LLVM-MCA comment directives.
 329
 330 .. code-block:: none
 331
 332   # LLVM-MCA-<INSTRUMENT_TYPE> <data>
 333     ...  ## asm
 334
 335 where `INSTRUMENT_TYPE` is a type defined by the target and expects
 336 to use `data`.
 337
 338 A comment starting with substring `LLVM-MCA-<INSTRUMENT_TYPE>`
 339 brings data into scope for llvm-mca to use in its analysis for
 340 all following instructions.
 341
 342 If a comment with the same `INSTRUMENT_TYPE` is found later in the
 343 instruction list, then the original InstrumentRegion will be
 344 automatically ended, and a new InstrumentRegion will begin.
 345
 346 If there are comments containing the different `INSTRUMENT_TYPE`,
 347 then both data sets remain available. In contrast with an AnalysisRegion,
 348 an InstrumentRegion does not need a comment to end the region.
 349
 350 Comments that are prefixed with `LLVM-MCA-` but do not correspond to
 351 a valid `INSTRUMENT_TYPE` for the target cause an error, except for
 352 `BEGIN` and `END`, since those correspond to AnalysisRegions. Comments
 353 that do not start with `LLVM-MCA-` are ignored by :program `llvm-mca`.
 354
 355 An instruction (a MCInst) is added to an InstrumentRegion R only
 356 if its location is in range [R.RangeStart, R.RangeEnd].
 357
 358 On RISCV targets, vector instructions have different behaviour depending
 359 on the LMUL. Code can be instrumented with a comment that takes the
 360 following form:
 361
 362 .. code-block:: none
 363
 364   # LLVM-MCA-RISCV-LMUL <M1|M2|M4|M8|MF2|MF4|MF8>
 365
 366 The RISCV InstrumentManager will override the schedule class for vector
 367 instructions to use the scheduling behaviour of its pseudo-instruction
 368 which is LMUL dependent. It makes sense to place RISCV instrument
 369 comments directly after `vset{i}vl{i}` instructions, although
 370 they can be placed anywhere in the program.
 371
 372 Example of program with no call to `vset{i}vl{i}`:
 373
 374 .. code-block:: none
 375
 376   # LLVM-MCA-RISCV-LMUL M2
 377   vadd.vv v2, v2, v2
 378
 379 Example of program with call to `vset{i}vl{i}`:
 380
 381 .. code-block:: none
 382
 383   vsetvli zero, a0, e8, m1, tu, mu
 384   # LLVM-MCA-RISCV-LMUL M1
 385   vadd.vv v2, v2, v2
 386
 387 Example of program with multiple calls to `vset{i}vl{i}`:
 388
 389 .. code-block:: none
 390
 391   vsetvli zero, a0, e8, m1, tu, mu
 392   # LLVM-MCA-RISCV-LMUL M1
 393   vadd.vv v2, v2, v2
 394   vsetvli zero, a0, e8, m8, tu, mu
 395   # LLVM-MCA-RISCV-LMUL M8
 396   vadd.vv v2, v2, v2
 397
 398 Example of program with call to `vsetvl`:
 399
 400 .. code-block:: none
 401
 402  vsetvl rd, rs1, rs2
 403  # LLVM-MCA-RISCV-LMUL M1
 404  vadd.vv v12, v12, v12
 405  vsetvl rd, rs1, rs2
 406  # LLVM-MCA-RISCV-LMUL M4
 407  vadd.vv v12, v12, v12
 408
 409 HOW LLVM-MCA WORKS
 410 ------------------
 411
 412 :program:`llvm-mca` takes assembly code as input. The assembly code is parsed
 413 into a sequence of MCInst with the help of the existing LLVM target assembly
 414 parsers. The parsed sequence of MCInst is then analyzed by a ``Pipeline`` module
 415 to generate a performance report.
 416
 417 The Pipeline module simulates the execution of the machine code sequence in a
 418 loop of iterations (default is 100). During this process, the pipeline collects
 419 a number of execution related statistics. At the end of this process, the
 420 pipeline generates and prints a report from the collected statistics.
 421
 422 Here is an example of a performance report generated by the tool for a
 423 dot-product of two packed float vectors of four elements. The analysis is
 424 conducted for target x86, cpu btver2.  The following result can be produced via
 425 the following command using the example located at
 426 ``test/tools/llvm-mca/X86/BtVer2/dot-product.s``:
 427
 428 .. code-block:: bash
 429
 430   $ llvm-mca -mtriple=x86_64-unknown-unknown -mcpu=btver2 -iterations=300 dot-product.s
 431
 432 .. code-block:: none
 433
 434   Iterations:        300
 435   Instructions:      900
 436   Total Cycles:      610
 437   Total uOps:        900
 438
 439   Dispatch Width:    2
 440   uOps Per Cycle:    1.48
 441   IPC:               1.48
 442   Block RThroughput: 2.0
 443
 444
 445   Instruction Info:
 446   [1]: #uOps
 447   [2]: Latency
 448   [3]: RThroughput
 449   [4]: MayLoad
 450   [5]: MayStore
 451   [6]: HasSideEffects (U)
 452
 453   [1]    [2]    [3]    [4]    [5]    [6]    Instructions:
 454    1      2     1.00                        vmulps      %xmm0, %xmm1, %xmm2
 455    1      3     1.00                        vhaddps     %xmm2, %xmm2, %xmm3
 456    1      3     1.00                        vhaddps     %xmm3, %xmm3, %xmm4
 457
 458
 459   Resources:
 460   [0]   - JALU0
 461   [1]   - JALU1
 462   [2]   - JDiv
 463   [3]   - JFPA
 464   [4]   - JFPM
 465   [5]   - JFPU0
 466   [6]   - JFPU1
 467   [7]   - JLAGU
 468   [8]   - JMul
 469   [9]   - JSAGU
 470   [10]  - JSTC
 471   [11]  - JVALU0
 472   [12]  - JVALU1
 473   [13]  - JVIMUL
 474
 475
 476   Resource pressure per iteration:
 477   [0]    [1]    [2]    [3]    [4]    [5]    [6]    [7]    [8]    [9]    [10]   [11]   [12]   [13]
 478    -      -      -     2.00   1.00   2.00   1.00    -      -      -      -      -      -      -
 479
 480   Resource pressure by instruction:
 481   [0]    [1]    [2]    [3]    [4]    [5]    [6]    [7]    [8]    [9]    [10]   [11]   [12]   [13]   Instructions:
 482    -      -      -      -     1.00    -     1.00    -      -      -      -      -      -      -     vmulps      %xmm0, %xmm1, %xmm2
 483    -      -      -     1.00    -     1.00    -      -      -      -      -      -      -      -     vhaddps     %xmm2, %xmm2, %xmm3
 484    -      -      -     1.00    -     1.00    -      -      -      -      -      -      -      -     vhaddps     %xmm3, %xmm3, %xmm4
 485
 486 According to this report, the dot-product kernel has been executed 300 times,
 487 for a total of 900 simulated instructions. The total number of simulated micro
 488 opcodes (uOps) is also 900.
 489
 490 The report is structured in three main sections.  The first section collects a
 491 few performance numbers; the goal of this section is to give a very quick
 492 overview of the performance throughput. Important performance indicators are
 493 **IPC**, **uOps Per Cycle**, and  **Block RThroughput** (Block Reciprocal
 494 Throughput).
 495
 496 Field *DispatchWidth* is the maximum number of micro opcodes that are dispatched
 497 to the out-of-order backend every simulated cycle. For processors with an
 498 in-order backend, *DispatchWidth* is the maximum number of micro opcodes issued
 499 to the backend every simulated cycle.
 500
 501 IPC is computed dividing the total number of simulated instructions by the total
 502 number of cycles.
 503
 504 Field *Block RThroughput* is the reciprocal of the block throughput. Block
 505 throughput is a theoretical quantity computed as the maximum number of blocks
 506 (i.e. iterations) that can be executed per simulated clock cycle in the absence
 507 of loop carried dependencies. Block throughput is superiorly limited by the
 508 dispatch rate, and the availability of hardware resources.
 509
 510 In the absence of loop-carried data dependencies, the observed IPC tends to a
 511 theoretical maximum which can be computed by dividing the number of instructions
 512 of a single iteration by the `Block RThroughput`.
 513
 514 Field 'uOps Per Cycle' is computed dividing the total number of simulated micro
 515 opcodes by the total number of cycles. A delta between Dispatch Width and this
 516 field is an indicator of a performance issue. In the absence of loop-carried
 517 data dependencies, the observed 'uOps Per Cycle' should tend to a theoretical
 518 maximum throughput which can be computed by dividing the number of uOps of a
 519 single iteration by the `Block RThroughput`.
 520
 521 Field *uOps Per Cycle* is bounded from above by the dispatch width. That is
 522 because the dispatch width limits the maximum size of a dispatch group. Both IPC
 523 and 'uOps Per Cycle' are limited by the amount of hardware parallelism. The
 524 availability of hardware resources affects the resource pressure distribution,
 525 and it limits the number of instructions that can be executed in parallel every
 526 cycle.  A delta between Dispatch Width and the theoretical maximum uOps per
 527 Cycle (computed by dividing the number of uOps of a single iteration by the
 528 `Block RThroughput`) is an indicator of a performance bottleneck caused by the
 529 lack of hardware resources.
 530 In general, the lower the Block RThroughput, the better.
 531
 532 In this example, ``uOps per iteration/Block RThroughput`` is 1.50. Since there
 533 are no loop-carried dependencies, the observed `uOps Per Cycle` is expected to
 534 approach 1.50 when the number of iterations tends to infinity. The delta between
 535 the Dispatch Width (2.00), and the theoretical maximum uOp throughput (1.50) is
 536 an indicator of a performance bottleneck caused by the lack of hardware
 537 resources, and the *Resource pressure view* can help to identify the problematic
 538 resource usage.
 539
 540 The second section of the report is the `instruction info view`. It shows the
 541 latency and reciprocal throughput of every instruction in the sequence. It also
 542 reports extra information related to the number of micro opcodes, and opcode
 543 properties (i.e., 'MayLoad', 'MayStore', and 'HasSideEffects').
 544
 545 Field *RThroughput* is the reciprocal of the instruction throughput. Throughput
 546 is computed as the maximum number of instructions of a same type that can be
 547 executed per clock cycle in the absence of operand dependencies. In this
 548 example, the reciprocal throughput of a vector float multiply is 1
 549 cycles/instruction.  That is because the FP multiplier JFPM is only available
 550 from pipeline JFPU1.
 551
 552 Instruction encodings are displayed within the instruction info view when flag
 553 `-show-encoding` is specified.
 554
 555 Below is an example of `-show-encoding` output for the dot-product kernel:
 556
 557 .. code-block:: none
 558
 559   Instruction Info:
 560   [1]: #uOps
 561   [2]: Latency
 562   [3]: RThroughput
 563   [4]: MayLoad
 564   [5]: MayStore
 565   [6]: HasSideEffects (U)
 566   [7]: Encoding Size
 567
 568   [1]    [2]    [3]    [4]    [5]    [6]    [7]    Encodings:                    Instructions:
 569    1      2     1.00                         4     c5 f0 59 d0                   vmulps %xmm0, %xmm1, %xmm2
 570    1      4     1.00                         4     c5 eb 7c da                   vhaddps        %xmm2, %xmm2, %xmm3
 571    1      4     1.00                         4     c5 e3 7c e3                   vhaddps        %xmm3, %xmm3, %xmm4
 572
 573 The `Encoding Size` column shows the size in bytes of instructions.  The
 574 `Encodings` column shows the actual instruction encodings (byte sequences in
 575 hex).
 576
 577 The third section is the *Resource pressure view*.  This view reports
 578 the average number of resource cycles consumed every iteration by instructions
 579 for every processor resource unit available on the target.  Information is
 580 structured in two tables. The first table reports the number of resource cycles
 581 spent on average every iteration. The second table correlates the resource
 582 cycles to the machine instruction in the sequence. For example, every iteration
 583 of the instruction vmulps always executes on resource unit [6]
 584 (JFPU1 - floating point pipeline #1), consuming an average of 1 resource cycle
 585 per iteration.  Note that on AMD Jaguar, vector floating-point multiply can
 586 only be issued to pipeline JFPU1, while horizontal floating-point additions can
 587 only be issued to pipeline JFPU0.
 588
 589 The resource pressure view helps with identifying bottlenecks caused by high
 590 usage of specific hardware resources.  Situations with resource pressure mainly
 591 concentrated on a few resources should, in general, be avoided.  Ideally,
 592 pressure should be uniformly distributed between multiple resources.
 593
 594 Timeline View
 595 ^^^^^^^^^^^^^
 596 The timeline view produces a detailed report of each instruction's state
 597 transitions through an instruction pipeline.  This view is enabled by the
 598 command line option ``-timeline``.  As instructions transition through the
 599 various stages of the pipeline, their states are depicted in the view report.
 600 These states are represented by the following characters:
 601
 602 * D : Instruction dispatched.
 603 * e : Instruction executing.
 604 * E : Instruction executed.
 605 * R : Instruction retired.
 606 * = : Instruction already dispatched, waiting to be executed.
 607 * \- : Instruction executed, waiting to be retired.
 608
 609 Below is the timeline view for a subset of the dot-product example located in
 610 ``test/tools/llvm-mca/X86/BtVer2/dot-product.s`` and processed by
 611 :program:`llvm-mca` using the following command:
 612
 613 .. code-block:: bash
 614
 615   $ llvm-mca -mtriple=x86_64-unknown-unknown -mcpu=btver2 -iterations=3 -timeline dot-product.s
 616
 617 .. code-block:: none
 618
 619   Timeline view:
 620                       012345
 621   Index     0123456789
 622
 623   [0,0]     DeeER.    .    .   vmulps   %xmm0, %xmm1, %xmm2
 624   [0,1]     D==eeeER  .    .   vhaddps  %xmm2, %xmm2, %xmm3
 625   [0,2]     .D====eeeER    .   vhaddps  %xmm3, %xmm3, %xmm4
 626   [1,0]     .DeeE-----R    .   vmulps   %xmm0, %xmm1, %xmm2
 627   [1,1]     . D=eeeE---R   .   vhaddps  %xmm2, %xmm2, %xmm3
 628   [1,2]     . D====eeeER   .   vhaddps  %xmm3, %xmm3, %xmm4
 629   [2,0]     .  DeeE-----R  .   vmulps   %xmm0, %xmm1, %xmm2
 630   [2,1]     .  D====eeeER  .   vhaddps  %xmm2, %xmm2, %xmm3
 631   [2,2]     .   D======eeeER   vhaddps  %xmm3, %xmm3, %xmm4
 632
 633
 634   Average Wait times (based on the timeline view):
 635   [0]: Executions
 636   [1]: Average time spent waiting in a scheduler's queue
 637   [2]: Average time spent waiting in a scheduler's queue while ready
 638   [3]: Average time elapsed from WB until retire stage
 639
 640         [0]    [1]    [2]    [3]
 641   0.     3     1.0    1.0    3.3       vmulps   %xmm0, %xmm1, %xmm2
 642   1.     3     3.3    0.7    1.0       vhaddps  %xmm2, %xmm2, %xmm3
 643   2.     3     5.7    0.0    0.0       vhaddps  %xmm3, %xmm3, %xmm4
 644          3     3.3    0.5    1.4       <total>
 645
 646 The timeline view is interesting because it shows instruction state changes
 647 during execution.  It also gives an idea of how the tool processes instructions
 648 executed on the target, and how their timing information might be calculated.
 649
 650 The timeline view is structured in two tables.  The first table shows
 651 instructions changing state over time (measured in cycles); the second table
 652 (named *Average Wait times*) reports useful timing statistics, which should
 653 help diagnose performance bottlenecks caused by long data dependencies and
 654 sub-optimal usage of hardware resources.
 655
 656 An instruction in the timeline view is identified by a pair of indices, where
 657 the first index identifies an iteration, and the second index is the
 658 instruction index (i.e., where it appears in the code sequence).  Since this
 659 example was generated using 3 iterations: ``-iterations=3``, the iteration
 660 indices range from 0-2 inclusively.
 661
 662 Excluding the first and last column, the remaining columns are in cycles.
 663 Cycles are numbered sequentially starting from 0.
 664
 665 From the example output above, we know the following:
 666
 667 * Instruction [1,0] was dispatched at cycle 1.
 668 * Instruction [1,0] started executing at cycle 2.
 669 * Instruction [1,0] reached the write back stage at cycle 4.
 670 * Instruction [1,0] was retired at cycle 10.
 671
 672 Instruction [1,0] (i.e., vmulps from iteration #1) does not have to wait in the
 673 scheduler's queue for the operands to become available. By the time vmulps is
 674 dispatched, operands are already available, and pipeline JFPU1 is ready to
 675 serve another instruction.  So the instruction can be immediately issued on the
 676 JFPU1 pipeline. That is demonstrated by the fact that the instruction only
 677 spent 1cy in the scheduler's queue.
 678
 679 There is a gap of 5 cycles between the write-back stage and the retire event.
 680 That is because instructions must retire in program order, so [1,0] has to wait
 681 for [0,2] to be retired first (i.e., it has to wait until cycle 10).
 682
 683 In the example, all instructions are in a RAW (Read After Write) dependency
 684 chain.  Register %xmm2 written by vmulps is immediately used by the first
 685 vhaddps, and register %xmm3 written by the first vhaddps is used by the second
 686 vhaddps.  Long data dependencies negatively impact the ILP (Instruction Level
 687 Parallelism).
 688
 689 In the dot-product example, there are anti-dependencies introduced by
 690 instructions from different iterations.  However, those dependencies can be
 691 removed at register renaming stage (at the cost of allocating register aliases,
 692 and therefore consuming physical registers).
 693
 694 Table *Average Wait times* helps diagnose performance issues that are caused by
 695 the presence of long latency instructions and potentially long data dependencies
 696 which may limit the ILP. Last row, ``<total>``, shows a global average over all
 697 instructions measured. Note that :program:`llvm-mca`, by default, assumes at
 698 least 1cy between the dispatch event and the issue event.
 699
 700 When the performance is limited by data dependencies and/or long latency
 701 instructions, the number of cycles spent while in the *ready* state is expected
 702 to be very small when compared with the total number of cycles spent in the
 703 scheduler's queue.  The difference between the two counters is a good indicator
 704 of how large of an impact data dependencies had on the execution of the
 705 instructions.  When performance is mostly limited by the lack of hardware
 706 resources, the delta between the two counters is small.  However, the number of
 707 cycles spent in the queue tends to be larger (i.e., more than 1-3cy),
 708 especially when compared to other low latency instructions.
 709
 710 Bottleneck Analysis
 711 ^^^^^^^^^^^^^^^^^^^
 712 The ``-bottleneck-analysis`` command line option enables the analysis of
 713 performance bottlenecks.
 714
 715 This analysis is potentially expensive. It attempts to correlate increases in
 716 backend pressure (caused by pipeline resource pressure and data dependencies) to
 717 dynamic dispatch stalls.
 718
 719 Below is an example of ``-bottleneck-analysis`` output generated by
 720 :program:`llvm-mca` for 500 iterations of the dot-product example on btver2.
 721
 722 .. code-block:: none
 723
 724
 725   Cycles with backend pressure increase [ 48.07% ]
 726   Throughput Bottlenecks:
 727     Resource Pressure       [ 47.77% ]
 728     - JFPA  [ 47.77% ]
 729     - JFPU0  [ 47.77% ]
 730     Data Dependencies:      [ 0.30% ]
 731     - Register Dependencies [ 0.30% ]
 732     - Memory Dependencies   [ 0.00% ]
 733
 734   Critical sequence based on the simulation:
 735
 736                 Instruction                         Dependency Information
 737    +----< 2.    vhaddps %xmm3, %xmm3, %xmm4
 738    |
 739    |    < loop carried >
 740    |
 741    |      0.    vmulps  %xmm0, %xmm1, %xmm2
 742    +----> 1.    vhaddps %xmm2, %xmm2, %xmm3         ## RESOURCE interference:  JFPA [ probability: 74% ]
 743    +----> 2.    vhaddps %xmm3, %xmm3, %xmm4         ## REGISTER dependency:  %xmm3
 744    |
 745    |    < loop carried >
 746    |
 747    +----> 1.    vhaddps %xmm2, %xmm2, %xmm3         ## RESOURCE interference:  JFPA [ probability: 74% ]
 748
 749
 750 According to the analysis, throughput is limited by resource pressure and not by
 751 data dependencies.  The analysis observed increases in backend pressure during
 752 48.07% of the simulated run. Almost all those pressure increase events were
 753 caused by contention on processor resources JFPA/JFPU0.
 754
 755 The `critical sequence` is the most expensive sequence of instructions according
 756 to the simulation. It is annotated to provide extra information about critical
 757 register dependencies and resource interferences between instructions.
 758
 759 Instructions from the critical sequence are expected to significantly impact
 760 performance. By construction, the accuracy of this analysis is strongly
 761 dependent on the simulation and (as always) by the quality of the processor
 762 model in llvm.
 763
 764 Bottleneck analysis is currently not supported for processors with an in-order
 765 backend.
 766
 767 Extra Statistics to Further Diagnose Performance Issues
 768 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 769 The ``-all-stats`` command line option enables extra statistics and performance
 770 counters for the dispatch logic, the reorder buffer, the retire control unit,
 771 and the register file.
 772
 773 Below is an example of ``-all-stats`` output generated by  :program:`llvm-mca`
 774 for 300 iterations of the dot-product example discussed in the previous
 775 sections.
 776
 777 .. code-block:: none
 778
 779   Dynamic Dispatch Stall Cycles:
 780   RAT     - Register unavailable:                      0
 781   RCU     - Retire tokens unavailable:                 0
 782   SCHEDQ  - Scheduler full:                            272  (44.6%)
 783   LQ      - Load queue full:                           0
 784   SQ      - Store queue full:                          0
 785   GROUP   - Static restrictions on the dispatch group: 0
 786
 787
 788   Dispatch Logic - number of cycles where we saw N micro opcodes dispatched:
 789   [# dispatched], [# cycles]
 790    0,              24  (3.9%)
 791    1,              272  (44.6%)
 792    2,              314  (51.5%)
 793
 794
 795   Schedulers - number of cycles where we saw N micro opcodes issued:
 796   [# issued], [# cycles]
 797    0,          7  (1.1%)
 798    1,          306  (50.2%)
 799    2,          297  (48.7%)
 800
 801   Scheduler's queue usage:
 802   [1] Resource name.
 803   [2] Average number of used buffer entries.
 804   [3] Maximum number of used buffer entries.
 805   [4] Total number of buffer entries.
 806
 807    [1]            [2]        [3]        [4]
 808   JALU01           0          0          20
 809   JFPU01           17         18         18
 810   JLSAGU           0          0          12
 811
 812
 813   Retire Control Unit - number of cycles where we saw N instructions retired:
 814   [# retired], [# cycles]
 815    0,           109  (17.9%)
 816    1,           102  (16.7%)
 817    2,           399  (65.4%)
 818
 819   Total ROB Entries:                64
 820   Max Used ROB Entries:             35  ( 54.7% )
 821   Average Used ROB Entries per cy:  32  ( 50.0% )
 822
 823
 824   Register File statistics:
 825   Total number of mappings created:    900
 826   Max number of mappings used:         35
 827
 828   *  Register File #1 -- JFpuPRF:
 829      Number of physical registers:     72
 830      Total number of mappings created: 900
 831      Max number of mappings used:      35
 832
 833   *  Register File #2 -- JIntegerPRF:
 834      Number of physical registers:     64
 835      Total number of mappings created: 0
 836      Max number of mappings used:      0
 837
 838 If we look at the *Dynamic Dispatch Stall Cycles* table, we see the counter for
 839 SCHEDQ reports 272 cycles.  This counter is incremented every time the dispatch
 840 logic is unable to dispatch a full group because the scheduler's queue is full.
 841
 842 Looking at the *Dispatch Logic* table, we see that the pipeline was only able to
 843 dispatch two micro opcodes 51.5% of the time.  The dispatch group was limited to
 844 one micro opcode 44.6% of the cycles, which corresponds to 272 cycles.  The
 845 dispatch statistics are displayed by either using the command option
 846 ``-all-stats`` or ``-dispatch-stats``.
 847
 848 The next table, *Schedulers*, presents a histogram displaying a count,
 849 representing the number of micro opcodes issued on some number of cycles. In
 850 this case, of the 610 simulated cycles, single opcodes were issued 306 times
 851 (50.2%) and there were 7 cycles where no opcodes were issued.
 852
 853 The *Scheduler's queue usage* table shows that the average and maximum number of
 854 buffer entries (i.e., scheduler queue entries) used at runtime.  Resource JFPU01
 855 reached its maximum (18 of 18 queue entries). Note that AMD Jaguar implements
 856 three schedulers:
 857
 858 * JALU01 - A scheduler for ALU instructions.
 859 * JFPU01 - A scheduler floating point operations.
 860 * JLSAGU - A scheduler for address generation.
 861
 862 The dot-product is a kernel of three floating point instructions (a vector
 863 multiply followed by two horizontal adds).  That explains why only the floating
 864 point scheduler appears to be used.
 865
 866 A full scheduler queue is either caused by data dependency chains or by a
 867 sub-optimal usage of hardware resources.  Sometimes, resource pressure can be
 868 mitigated by rewriting the kernel using different instructions that consume
 869 different scheduler resources.  Schedulers with a small queue are less resilient
 870 to bottlenecks caused by the presence of long data dependencies.  The scheduler
 871 statistics are displayed by using the command option ``-all-stats`` or
 872 ``-scheduler-stats``.
 873
 874 The next table, *Retire Control Unit*, presents a histogram displaying a count,
 875 representing the number of instructions retired on some number of cycles.  In
 876 this case, of the 610 simulated cycles, two instructions were retired during the
 877 same cycle 399 times (65.4%) and there were 109 cycles where no instructions
 878 were retired.  The retire statistics are displayed by using the command option
 879 ``-all-stats`` or ``-retire-stats``.
 880
 881 The last table presented is *Register File statistics*.  Each physical register
 882 file (PRF) used by the pipeline is presented in this table.  In the case of AMD
 883 Jaguar, there are two register files, one for floating-point registers (JFpuPRF)
 884 and one for integer registers (JIntegerPRF).  The table shows that of the 900
 885 instructions processed, there were 900 mappings created.  Since this dot-product
 886 example utilized only floating point registers, the JFPuPRF was responsible for
 887 creating the 900 mappings.  However, we see that the pipeline only used a
 888 maximum of 35 of 72 available register slots at any given time. We can conclude
 889 that the floating point PRF was the only register file used for the example, and
 890 that it was never resource constrained.  The register file statistics are
 891 displayed by using the command option ``-all-stats`` or
 892 ``-register-file-stats``.
 893
 894 In this example, we can conclude that the IPC is mostly limited by data
 895 dependencies, and not by resource pressure.
 896
 897 Instruction Flow
 898 ^^^^^^^^^^^^^^^^
 899 This section describes the instruction flow through the default pipeline of
 900 :program:`llvm-mca`, as well as the functional units involved in the process.
 901
 902 The default pipeline implements the following sequence of stages used to
 903 process instructions.
 904
 905 * Dispatch (Instruction is dispatched to the schedulers).
 906 * Issue (Instruction is issued to the processor pipelines).
 907 * Write Back (Instruction is executed, and results are written back).
 908 * Retire (Instruction is retired; writes are architecturally committed).
 909
 910 The in-order pipeline implements the following sequence of stages:
 911 * InOrderIssue (Instruction is issued to the processor pipelines).
 912 * Retire (Instruction is retired; writes are architecturally committed).
 913
 914 :program:`llvm-mca` assumes that instructions have all been decoded and placed
 915 into a queue before the simulation start. Therefore, the instruction fetch and
 916 decode stages are not modeled. Performance bottlenecks in the frontend are not
 917 diagnosed. Also, :program:`llvm-mca` does not model branch prediction.
 918
 919 Instruction Dispatch
 920 """"""""""""""""""""
 921 During the dispatch stage, instructions are picked in program order from a
 922 queue of already decoded instructions, and dispatched in groups to the
 923 simulated hardware schedulers.
 924
 925 The size of a dispatch group depends on the availability of the simulated
 926 hardware resources.  The processor dispatch width defaults to the value
 927 of the ``IssueWidth`` in LLVM's scheduling model.
 928
 929 An instruction can be dispatched if:
 930
 931 * The size of the dispatch group is smaller than processor's dispatch width.
 932 * There are enough entries in the reorder buffer.
 933 * There are enough physical registers to do register renaming.
 934 * The schedulers are not full.
 935
 936 Scheduling models can optionally specify which register files are available on
 937 the processor. :program:`llvm-mca` uses that information to initialize register
 938 file descriptors.  Users can limit the number of physical registers that are
 939 globally available for register renaming by using the command option
 940 ``-register-file-size``.  A value of zero for this option means *unbounded*. By
 941 knowing how many registers are available for renaming, the tool can predict
 942 dispatch stalls caused by the lack of physical registers.
 943
 944 The number of reorder buffer entries consumed by an instruction depends on the
 945 number of micro-opcodes specified for that instruction by the target scheduling
 946 model.  The reorder buffer is responsible for tracking the progress of
 947 instructions that are "in-flight", and retiring them in program order.  The
 948 number of entries in the reorder buffer defaults to the value specified by field
 949 `MicroOpBufferSize` in the target scheduling model.
 950
 951 Instructions that are dispatched to the schedulers consume scheduler buffer
 952 entries. :program:`llvm-mca` queries the scheduling model to determine the set
 953 of buffered resources consumed by an instruction.  Buffered resources are
 954 treated like scheduler resources.
 955
 956 Instruction Issue
 957 """""""""""""""""
 958 Each processor scheduler implements a buffer of instructions.  An instruction
 959 has to wait in the scheduler's buffer until input register operands become
 960 available.  Only at that point, does the instruction becomes eligible for
 961 execution and may be issued (potentially out-of-order) for execution.
 962 Instruction latencies are computed by :program:`llvm-mca` with the help of the
 963 scheduling model.
 964
 965 :program:`llvm-mca`'s scheduler is designed to simulate multiple processor
 966 schedulers.  The scheduler is responsible for tracking data dependencies, and
 967 dynamically selecting which processor resources are consumed by instructions.
 968 It delegates the management of processor resource units and resource groups to a
 969 resource manager.  The resource manager is responsible for selecting resource
 970 units that are consumed by instructions.  For example, if an instruction
 971 consumes 1cy of a resource group, the resource manager selects one of the
 972 available units from the group; by default, the resource manager uses a
 973 round-robin selector to guarantee that resource usage is uniformly distributed
 974 between all units of a group.
 975
 976 :program:`llvm-mca`'s scheduler internally groups instructions into three sets:
 977
 978 * WaitSet: a set of instructions whose operands are not ready.
 979 * ReadySet: a set of instructions ready to execute.
 980 * IssuedSet: a set of instructions executing.
 981
 982 Depending on the operands availability, instructions that are dispatched to the
 983 scheduler are either placed into the WaitSet or into the ReadySet.
 984
 985 Every cycle, the scheduler checks if instructions can be moved from the WaitSet
 986 to the ReadySet, and if instructions from the ReadySet can be issued to the
 987 underlying pipelines. The algorithm prioritizes older instructions over younger
 988 instructions.
 989
 990 Write-Back and Retire Stage
 991 """""""""""""""""""""""""""
 992 Issued instructions are moved from the ReadySet to the IssuedSet.  There,
 993 instructions wait until they reach the write-back stage.  At that point, they
 994 get removed from the queue and the retire control unit is notified.
 995
 996 When instructions are executed, the retire control unit flags the instruction as
 997 "ready to retire."
 998
 999 Instructions are retired in program order.  The register file is notified of the
1000 retirement so that it can free the physical registers that were allocated for
1001 the instruction during the register renaming stage.
1002
1003 Load/Store Unit and Memory Consistency Model
1004 """"""""""""""""""""""""""""""""""""""""""""
1005 To simulate an out-of-order execution of memory operations, :program:`llvm-mca`
1006 utilizes a simulated load/store unit (LSUnit) to simulate the speculative
1007 execution of loads and stores.
1008
1009 Each load (or store) consumes an entry in the load (or store) queue. Users can
1010 specify flags ``-lqueue`` and ``-squeue`` to limit the number of entries in the
1011 load and store queues respectively. The queues are unbounded by default.
1012
1013 The LSUnit implements a relaxed consistency model for memory loads and stores.
1014 The rules are:
1015
1016 1. A younger load is allowed to pass an older load only if there are no
1017    intervening stores or barriers between the two loads.
1018 2. A younger load is allowed to pass an older store provided that the load does
1019    not alias with the store.
1020 3. A younger store is not allowed to pass an older store.
1021 4. A younger store is not allowed to pass an older load.
1022
1023 By default, the LSUnit optimistically assumes that loads do not alias
1024 (`-noalias=true`) store operations.  Under this assumption, younger loads are
1025 always allowed to pass older stores.  Essentially, the LSUnit does not attempt
1026 to run any alias analysis to predict when loads and stores do not alias with
1027 each other.
1028
1029 Note that, in the case of write-combining memory, rule 3 could be relaxed to
1030 allow reordering of non-aliasing store operations.  That being said, at the
1031 moment, there is no way to further relax the memory model (``-noalias`` is the
1032 only option).  Essentially, there is no option to specify a different memory
1033 type (e.g., write-back, write-combining, write-through; etc.) and consequently
1034 to weaken, or strengthen, the memory model.
1035
1036 Other limitations are:
1037
1038 * The LSUnit does not know when store-to-load forwarding may occur.
1039 * The LSUnit does not know anything about cache hierarchy and memory types.
1040 * The LSUnit does not know how to identify serializing operations and memory
1041   fences.
1042
1043 The LSUnit does not attempt to predict if a load or store hits or misses the L1
1044 cache.  It only knows if an instruction "MayLoad" and/or "MayStore."  For
1045 loads, the scheduling model provides an "optimistic" load-to-use latency (which
1046 usually matches the load-to-use latency for when there is a hit in the L1D).
1047
1048 :program:`llvm-mca` does not (on its own) know about serializing operations or
1049 memory-barrier like instructions.  The LSUnit used to conservatively use an
1050 instruction's "MayLoad", "MayStore", and unmodeled side effects flags to
1051 determine whether an instruction should be treated as a memory-barrier. This was
1052 inaccurate in general and was changed so that now each instruction has an
1053 IsAStoreBarrier and IsALoadBarrier flag. These flags are mca specific and
1054 default to false for every instruction. If any instruction should have either of
1055 these flags set, it should be done within the target's InstrPostProcess class.
1056 For an example, look at the `X86InstrPostProcess::postProcessInstruction` method
1057 within `llvm/lib/Target/X86/MCA/X86CustomBehaviour.cpp`.
1058
1059 A load/store barrier consumes one entry of the load/store queue.  A load/store
1060 barrier enforces ordering of loads/stores.  A younger load cannot pass a load
1061 barrier.  Also, a younger store cannot pass a store barrier.  A younger load
1062 has to wait for the memory/load barrier to execute.  A load/store barrier is
1063 "executed" when it becomes the oldest entry in the load/store queue(s). That
1064 also means, by construction, all of the older loads/stores have been executed.
1065
1066 In conclusion, the full set of load/store consistency rules are:
1067
1068 #. A store may not pass a previous store.
1069 #. A store may not pass a previous load (regardless of ``-noalias``).
1070 #. A store has to wait until an older store barrier is fully executed.
1071 #. A load may pass a previous load.
1072 #. A load may not pass a previous store unless ``-noalias`` is set.
1073 #. A load has to wait until an older load barrier is fully executed.
1074
1075 In-order Issue and Execute
1076 """"""""""""""""""""""""""""""""""""
1077 In-order processors are modelled as a single ``InOrderIssueStage`` stage. It
1078 bypasses Dispatch, Scheduler and Load/Store unit. Instructions are issued as
1079 soon as their operand registers are available and resource requirements are
1080 met. Multiple instructions can be issued in one cycle according to the value of
1081 the ``IssueWidth`` parameter in LLVM's scheduling model.
1082
1083 Once issued, an instruction is moved to ``IssuedInst`` set until it is ready to
1084 retire. :program:`llvm-mca` ensures that writes are committed in-order. However,
1085 an instruction is allowed to commit writes and retire out-of-order if
1086 ``RetireOOO`` property is true for at least one of its writes.
1087
1088 Custom Behaviour
1089 """"""""""""""""""""""""""""""""""""
1090 Due to certain instructions not being expressed perfectly within their
1091 scheduling model, :program:`llvm-mca` isn't always able to simulate them
1092 perfectly. Modifying the scheduling model isn't always a viable
1093 option though (maybe because the instruction is modeled incorrectly on
1094 purpose or the instruction's behaviour is quite complex). The
1095 CustomBehaviour class can be used in these cases to enforce proper
1096 instruction modeling (often by customizing data dependencies and detecting
1097 hazards that :program:`llvm-mca` has no way of knowing about).
1098
1099 :program:`llvm-mca` comes with one generic and multiple target specific
1100 CustomBehaviour classes. The generic class will be used if the ``-disable-cb``
1101 flag is used or if a target specific CustomBehaviour class doesn't exist for
1102 that target. (The generic class does nothing.) Currently, the CustomBehaviour
1103 class is only a part of the in-order pipeline, but there are plans to add it
1104 to the out-of-order pipeline in the future.
1105
1106 CustomBehaviour's main method is `checkCustomHazard()` which uses the
1107 current instruction and a list of all instructions still executing within
1108 the pipeline to determine if the current instruction should be dispatched.
1109 As output, the method returns an integer representing the number of cycles
1110 that the current instruction must stall for (this can be an underestimate
1111 if you don't know the exact number and a value of 0 represents no stall).
1112
1113 If you'd like to add a CustomBehaviour class for a target that doesn't
1114 already have one, refer to an existing implementation to see how to set it
1115 up. The classes are implemented within the target specific backend (for
1116 example `/llvm/lib/Target/AMDGPU/MCA/`) so that they can access backend symbols.
1117
1118 Instrument Manager
1119 """"""""""""""""""""""""""""""""""""
1120 On certain architectures, scheduling information for certain instructions
1121 do not contain all of the information required to identify the most precise
1122 schedule class. For example, data that can have an impact on scheduling can
1123 be stored in CSR registers.
1124
1125 One example of this is on RISCV, where values in registers such as `vtype`
1126 and `vl` change the scheduling behaviour of vector instructions. Since MCA
1127 does not keep track of the values in registers, instrument comments can
1128 be used to specify these values.
1129
1130 InstrumentManager's main function is `getSchedClassID()` which has access
1131 to the MCInst and all of the instruments that are active for that MCInst.
1132 This function can use the instruments to override the schedule class of
1133 the MCInst.
1134
1135 On RISCV, instrument comments containing LMUL information are used
1136 by `getSchedClassID()` to map a vector instruction and the active
1137 LMUL to the scheduling class of the pseudo-instruction that describes
1138 that base instruction and the active LMUL.
1139
1140 Custom Views
1141 """"""""""""""""""""""""""""""""""""
1142 :program:`llvm-mca` comes with several Views such as the Timeline View and
1143 Summary View. These Views are generic and can work with most (if not all)
1144 targets. If you wish to add a new View to :program:`llvm-mca` and it does not
1145 require any backend functionality that is not already exposed through MC layer
1146 classes (MCSubtargetInfo, MCInstrInfo, etc.), please add it to the
1147 `/tools/llvm-mca/View/` directory. However, if your new View is target specific
1148 AND requires unexposed backend symbols or functionality, you can define it in
1149 the `/lib/Target/<TargetName>/MCA/` directory.
1150
1151 To enable this target specific View, you will have to use this target's
1152 CustomBehaviour class to override the `CustomBehaviour::getViews()` methods.
1153 There are 3 variations of these methods based on where you want your View to
1154 appear in the output: `getStartViews()`, `getPostInstrInfoViews()`, and
1155 `getEndViews()`. These methods returns a vector of Views so you will want to
1156 return a vector containing all of the target specific Views for the target in
1157 question.
1158
1159 Because these target specific (and backend dependent) Views require the
1160 `CustomBehaviour::getViews()` variants, these Views will not be enabled if
1161 the `-disable-cb` flag is used.
1162
1163 Enabling these custom Views does not affect the non-custom (generic) Views.
1164 Continue to use the usual command line arguments to enable / disable those
1165 Views.