llvm/docs/CommandGuide/llvm-mca.rst

   1 llvm-mca - LLVM Machine Code Analyzer
   2 =====================================
   3
   4 .. program:: llvm-mca
   5
   6 SYNOPSIS
   7 --------
   8
   9 :program:`llvm-mca` [*options*] [input]
  10
  11 DESCRIPTION
  12 -----------
  13
  14 :program:`llvm-mca` is a performance analysis tool that uses information
  15 available in LLVM (e.g. scheduling models) to statically measure the performance
  16 of machine code in a specific CPU.
  17
  18 Performance is measured in terms of throughput as well as processor resource
  19 consumption. The tool currently works for processors with a backend for which
  20 there is a scheduling model available in LLVM.
  21
  22 The main goal of this tool is not just to predict the performance of the code
  23 when run on the target, but also help with diagnosing potential performance
  24 issues.
  25
  26 Given an assembly code sequence, :program:`llvm-mca` estimates the Instructions
  27 Per Cycle (IPC), as well as hardware resource pressure. The analysis and
  28 reporting style were inspired by the IACA tool from Intel.
  29
  30 For example, you can compile code with clang, output assembly, and pipe it
  31 directly into :program:`llvm-mca` for analysis:
  32
  33 .. code-block:: bash
  34
  35   $ clang foo.c -O2 --target=x86_64 -S -o - | llvm-mca -mcpu=btver2
  36
  37 Or for Intel syntax:
  38
  39 .. code-block:: bash
  40
  41   $ clang foo.c -O2 --target=x86_64 -masm=intel -S -o - | llvm-mca -mcpu=btver2
  42
  43 (:program:`llvm-mca` detects Intel syntax by the presence of an `.intel_syntax`
  44 directive at the beginning of the input.  By default its output syntax matches
  45 that of its input.)
  46
  47 Scheduling models are not just used to compute instruction latencies and
  48 throughput, but also to understand what processor resources are available
  49 and how to simulate them.
  50
  51 By design, the quality of the analysis conducted by :program:`llvm-mca` is
  52 inevitably affected by the quality of the scheduling models in LLVM.
  53
  54 If you see that the performance report is not accurate for a processor,
  55 please `file a bug <https://github.com/llvm/llvm-project/issues>`_
  56 against the appropriate backend.
  57
  58 OPTIONS
  59 -------
  60
  61 If ``input`` is "``-``" or omitted, :program:`llvm-mca` reads from standard
  62 input. Otherwise, it will read from the specified filename.
  63
  64 If the :option:`-o` option is omitted, then :program:`llvm-mca` will send its output
  65 to standard output if the input is from standard input.  If the :option:`-o`
  66 option specifies "``-``", then the output will also be sent to standard output.
  67
  68
  69 .. option:: -help
  70
  71  Print a summary of command line options.
  72
  73 .. option:: -o <filename>
  74
  75  Use ``<filename>`` as the output filename. See the summary above for more
  76  details.
  77
  78 .. option:: -mtriple=<target triple>
  79
  80  Specify a target triple string.
  81
  82 .. option:: -march=<arch>
  83
  84  Specify the architecture for which to analyze the code. It defaults to the
  85  host default target.
  86
  87 .. option:: -mcpu=<cpuname>
  88
  89   Specify the processor for which to analyze the code.  By default, the cpu name
  90   is autodetected from the host.
  91
  92 .. option:: -output-asm-variant=<variant id>
  93
  94  Specify the output assembly variant for the report generated by the tool.
  95  On x86, possible values are [0, 1]. A value of 0 (vic. 1) for this flag enables
  96  the AT&T (vic. Intel) assembly format for the code printed out by the tool in
  97  the analysis report.
  98
  99 .. option:: -print-imm-hex
 100
 101  Prefer hex format for numeric literals in the output assembly printed as part
 102  of the report.
 103
 104 .. option:: -dispatch=<width>
 105
 106  Specify a different dispatch width for the processor. The dispatch width
 107  defaults to field 'IssueWidth' in the processor scheduling model.  If width is
 108  zero, then the default dispatch width is used.
 109
 110 .. option:: -register-file-size=<size>
 111
 112  Specify the size of the register file. When specified, this flag limits how
 113  many physical registers are available for register renaming purposes. A value
 114  of zero for this flag means "unlimited number of physical registers".
 115
 116 .. option:: -iterations=<number of iterations>
 117
 118  Specify the number of iterations to run. If this flag is set to 0, then the
 119  tool sets the number of iterations to a default value (i.e. 100).
 120
 121 .. option:: -noalias=<bool>
 122
 123   If set, the tool assumes that loads and stores don't alias. This is the
 124   default behavior.
 125
 126 .. option:: -lqueue=<load queue size>
 127
 128   Specify the size of the load queue in the load/store unit emulated by the tool.
 129   By default, the tool assumes an unbound number of entries in the load queue.
 130   A value of zero for this flag is ignored, and the default load queue size is
 131   used instead.
 132
 133 .. option:: -squeue=<store queue size>
 134
 135   Specify the size of the store queue in the load/store unit emulated by the
 136   tool. By default, the tool assumes an unbound number of entries in the store
 137   queue. A value of zero for this flag is ignored, and the default store queue
 138   size is used instead.
 139
 140 .. option:: -timeline
 141
 142   Enable the timeline view.
 143
 144 .. option:: -timeline-max-iterations=<iterations>
 145
 146   Limit the number of iterations to print in the timeline view. By default, the
 147   timeline view prints information for up to 10 iterations.
 148
 149 .. option:: -timeline-max-cycles=<cycles>
 150
 151   Limit the number of cycles in the timeline view, or use 0 for no limit. By
 152   default, the number of cycles is set to 80.
 153
 154 .. option:: -resource-pressure
 155
 156   Enable the resource pressure view. This is enabled by default.
 157
 158 .. option:: -register-file-stats
 159
 160   Enable register file usage statistics.
 161
 162 .. option:: -dispatch-stats
 163
 164   Enable extra dispatch statistics. This view collects and analyzes instruction
 165   dispatch events, as well as static/dynamic dispatch stall events. This view
 166   is disabled by default.
 167
 168 .. option:: -scheduler-stats
 169
 170   Enable extra scheduler statistics. This view collects and analyzes instruction
 171   issue events. This view is disabled by default.
 172
 173 .. option:: -retire-stats
 174
 175   Enable extra retire control unit statistics. This view is disabled by default.
 176
 177 .. option:: -instruction-info
 178
 179   Enable the instruction info view. This is enabled by default.
 180
 181 .. option:: -show-encoding
 182
 183   Enable the printing of instruction encodings within the instruction info view.
 184
 185 .. option:: -show-barriers
 186
 187   Enable the printing of LoadBarrier and StoreBarrier flags within the
 188   instruction info view.
 189
 190 .. option:: -all-stats
 191
 192   Print all hardware statistics. This enables extra statistics related to the
 193   dispatch logic, the hardware schedulers, the register file(s), and the retire
 194   control unit. This option is disabled by default.
 195
 196 .. option:: -all-views
 197
 198   Enable all the view.
 199
 200 .. option:: -instruction-tables
 201
 202   Prints resource pressure information based on the static information
 203   available from the processor model. This differs from the resource pressure
 204   view because it doesn't require that the code is simulated. It instead prints
 205   the theoretical uniform distribution of resource pressure for every
 206   instruction in sequence.
 207
 208 .. option:: -bottleneck-analysis
 209
 210   Print information about bottlenecks that affect the throughput. This analysis
 211   can be expensive, and it is disabled by default. Bottlenecks are highlighted
 212   in the summary view. Bottleneck analysis is currently not supported for
 213   processors with an in-order backend.
 214
 215 .. option:: -json
 216
 217   Print the requested views in valid JSON format. The instructions and the
 218   processor resources are printed as members of special top level JSON objects.
 219   The individual views refer to them by index. However, not all views are
 220   currently supported. For example, the report from the bottleneck analysis is
 221   not printed out in JSON. All the default views are currently supported.
 222
 223 .. option:: -disable-cb
 224
 225   Force usage of the generic CustomBehaviour and InstrPostProcess classes rather
 226   than using the target specific implementation. The generic classes never
 227   detect any custom hazards or make any post processing modifications to
 228   instructions.
 229
 230 .. option:: -disable-im
 231
 232   Force usage of the generic InstrumentManager rather than using the target
 233   specific implementation. The generic class creates Instruments that provide
 234   no extra information, and InstrumentManager never overrides the default
 235   schedule class for a given instruction.
 236
 237 .. option:: -skip-unsupported-instructions=<reason>
 238
 239   Force :program:`llvm-mca` to continue in the presence of instructions which do
 240   not parse or lack key scheduling information. Note that the resulting analysis
 241   is impacted since those unsupported instructions are ignored as-if they are
 242   not supplied as a part of the input.
 243
 244   The choice of `<reason>` controls the when mca will report an error.
 245   `<reason>` may be `none` (default), `lack-sched`, `parse-failure`, `any`.
 246
 247 EXIT STATUS
 248 -----------
 249
 250 :program:`llvm-mca` returns 0 on success. Otherwise, an error message is printed
 251 to standard error, and the tool returns 1.
 252
 253 USING MARKERS TO ANALYZE SPECIFIC CODE BLOCKS
 254 ---------------------------------------------
 255 :program:`llvm-mca` allows for the optional usage of special code comments to
 256 mark regions of the assembly code to be analyzed.  A comment starting with
 257 substring ``LLVM-MCA-BEGIN`` marks the beginning of an analysis region. A
 258 comment starting with substring ``LLVM-MCA-END`` marks the end of a region.
 259 For example:
 260
 261 .. code-block:: none
 262
 263   # LLVM-MCA-BEGIN
 264     ...
 265   # LLVM-MCA-END
 266
 267 If no user-defined region is specified, then :program:`llvm-mca` assumes a
 268 default region which contains every instruction in the input file.  Every region
 269 is analyzed in isolation, and the final performance report is the union of all
 270 the reports generated for every analysis region.
 271
 272 Analysis regions can have names. For example:
 273
 274 .. code-block:: none
 275
 276   # LLVM-MCA-BEGIN A simple example
 277     add %eax, %eax
 278   # LLVM-MCA-END
 279
 280 The code from the example above defines a region named "A simple example" with a
 281 single instruction in it. Note how the region name doesn't have to be repeated
 282 in the ``LLVM-MCA-END`` directive. In the absence of overlapping regions,
 283 an anonymous ``LLVM-MCA-END`` directive always ends the currently active user
 284 defined region.
 285
 286 Example of nesting regions:
 287
 288 .. code-block:: none
 289
 290   # LLVM-MCA-BEGIN foo
 291     add %eax, %edx
 292   # LLVM-MCA-BEGIN bar
 293     sub %eax, %edx
 294   # LLVM-MCA-END bar
 295   # LLVM-MCA-END foo
 296
 297 Example of overlapping regions:
 298
 299 .. code-block:: none
 300
 301   # LLVM-MCA-BEGIN foo
 302     add %eax, %edx
 303   # LLVM-MCA-BEGIN bar
 304     sub %eax, %edx
 305   # LLVM-MCA-END foo
 306     add %eax, %edx
 307   # LLVM-MCA-END bar
 308
 309 Note that multiple anonymous regions cannot overlap. Also, overlapping regions
 310 cannot have the same name.
 311
 312 There is no support for marking regions from high-level source code, like C or
 313 C++. As a workaround, inline assembly directives may be used:
 314
 315 .. code-block:: c++
 316
 317   int foo(int a, int b) {
 318     __asm volatile("# LLVM-MCA-BEGIN foo":::"memory");
 319     a += 42;
 320     __asm volatile("# LLVM-MCA-END":::"memory");
 321     a *= b;
 322     return a;
 323   }
 324
 325 However, this interferes with optimizations like loop vectorization and may have
 326 an impact on the code generated. This is because the ``__asm`` statements are
 327 seen as real code having important side effects, which limits how the code
 328 around them can be transformed. If users want to make use of inline assembly
 329 to emit markers, then the recommendation is to always verify that the output
 330 assembly is equivalent to the assembly generated in the absence of markers.
 331 The `Clang options to emit optimization reports <https://clang.llvm.org/docs/UsersManual.html#options-to-emit-optimization-reports>`_
 332 can also help in detecting missed optimizations.
 333
 334 INSTRUMENT REGIONS
 335 ------------------
 336
 337 An InstrumentRegion describes a region of assembly code guarded by
 338 special LLVM-MCA comment directives.
 339
 340 .. code-block:: none
 341
 342   # LLVM-MCA-<INSTRUMENT_TYPE> <data>
 343     ...  ## asm
 344
 345 where `INSTRUMENT_TYPE` is a type defined by the target and expects
 346 to use `data`.
 347
 348 A comment starting with substring `LLVM-MCA-<INSTRUMENT_TYPE>`
 349 brings data into scope for llvm-mca to use in its analysis for
 350 all following instructions.
 351
 352 If a comment with the same `INSTRUMENT_TYPE` is found later in the
 353 instruction list, then the original InstrumentRegion will be
 354 automatically ended, and a new InstrumentRegion will begin.
 355
 356 If there are comments containing the different `INSTRUMENT_TYPE`,
 357 then both data sets remain available. In contrast with an AnalysisRegion,
 358 an InstrumentRegion does not need a comment to end the region.
 359
 360 Comments that are prefixed with `LLVM-MCA-` but do not correspond to
 361 a valid `INSTRUMENT_TYPE` for the target cause an error, except for
 362 `BEGIN` and `END`, since those correspond to AnalysisRegions. Comments
 363 that do not start with `LLVM-MCA-` are ignored by :program:`llvm-mca`.
 364
 365 An instruction (a MCInst) is added to an InstrumentRegion R only
 366 if its location is in range [R.RangeStart, R.RangeEnd].
 367
 368 On RISCV targets, vector instructions have different behaviour depending
 369 on the LMUL. Code can be instrumented with a comment that takes the
 370 following form:
 371
 372 .. code-block:: none
 373
 374   # LLVM-MCA-RISCV-LMUL <M1|M2|M4|M8|MF2|MF4|MF8>
 375
 376 The RISCV InstrumentManager will override the schedule class for vector
 377 instructions to use the scheduling behaviour of its pseudo-instruction
 378 which is LMUL dependent. It makes sense to place RISCV instrument
 379 comments directly after `vset{i}vl{i}` instructions, although
 380 they can be placed anywhere in the program.
 381
 382 Example of program with no call to `vset{i}vl{i}`:
 383
 384 .. code-block:: none
 385
 386   # LLVM-MCA-RISCV-LMUL M2
 387   vadd.vv v2, v2, v2
 388
 389 Example of program with call to `vset{i}vl{i}`:
 390
 391 .. code-block:: none
 392
 393   vsetvli zero, a0, e8, m1, tu, mu
 394   # LLVM-MCA-RISCV-LMUL M1
 395   vadd.vv v2, v2, v2
 396
 397 Example of program with multiple calls to `vset{i}vl{i}`:
 398
 399 .. code-block:: none
 400
 401   vsetvli zero, a0, e8, m1, tu, mu
 402   # LLVM-MCA-RISCV-LMUL M1
 403   vadd.vv v2, v2, v2
 404   vsetvli zero, a0, e8, m8, tu, mu
 405   # LLVM-MCA-RISCV-LMUL M8
 406   vadd.vv v2, v2, v2
 407
 408 Example of program with call to `vsetvl`:
 409
 410 .. code-block:: none
 411
 412  vsetvl rd, rs1, rs2
 413  # LLVM-MCA-RISCV-LMUL M1
 414  vadd.vv v12, v12, v12
 415  vsetvl rd, rs1, rs2
 416  # LLVM-MCA-RISCV-LMUL M4
 417  vadd.vv v12, v12, v12
 418
 419 HOW LLVM-MCA WORKS
 420 ------------------
 421
 422 :program:`llvm-mca` takes assembly code as input. The assembly code is parsed
 423 into a sequence of MCInst with the help of the existing LLVM target assembly
 424 parsers. The parsed sequence of MCInst is then analyzed by a ``Pipeline`` module
 425 to generate a performance report.
 426
 427 The Pipeline module simulates the execution of the machine code sequence in a
 428 loop of iterations (default is 100). During this process, the pipeline collects
 429 a number of execution related statistics. At the end of this process, the
 430 pipeline generates and prints a report from the collected statistics.
 431
 432 Here is an example of a performance report generated by the tool for a
 433 dot-product of two packed float vectors of four elements. The analysis is
 434 conducted for target x86, cpu btver2.  The following result can be produced via
 435 the following command using the example located at
 436 ``test/tools/llvm-mca/X86/BtVer2/dot-product.s``:
 437
 438 .. code-block:: bash
 439
 440   $ llvm-mca -mtriple=x86_64-unknown-unknown -mcpu=btver2 -iterations=300 dot-product.s
 441
 442 .. code-block:: none
 443
 444   Iterations:        300
 445   Instructions:      900
 446   Total Cycles:      610
 447   Total uOps:        900
 448
 449   Dispatch Width:    2
 450   uOps Per Cycle:    1.48
 451   IPC:               1.48
 452   Block RThroughput: 2.0
 453
 454
 455   Instruction Info:
 456   [1]: #uOps
 457   [2]: Latency
 458   [3]: RThroughput
 459   [4]: MayLoad
 460   [5]: MayStore
 461   [6]: HasSideEffects (U)
 462
 463   [1]    [2]    [3]    [4]    [5]    [6]    Instructions:
 464    1      2     1.00                        vmulps      %xmm0, %xmm1, %xmm2
 465    1      3     1.00                        vhaddps     %xmm2, %xmm2, %xmm3
 466    1      3     1.00                        vhaddps     %xmm3, %xmm3, %xmm4
 467
 468
 469   Resources:
 470   [0]   - JALU0
 471   [1]   - JALU1
 472   [2]   - JDiv
 473   [3]   - JFPA
 474   [4]   - JFPM
 475   [5]   - JFPU0
 476   [6]   - JFPU1
 477   [7]   - JLAGU
 478   [8]   - JMul
 479   [9]   - JSAGU
 480   [10]  - JSTC
 481   [11]  - JVALU0
 482   [12]  - JVALU1
 483   [13]  - JVIMUL
 484
 485
 486   Resource pressure per iteration:
 487   [0]    [1]    [2]    [3]    [4]    [5]    [6]    [7]    [8]    [9]    [10]   [11]   [12]   [13]
 488    -      -      -     2.00   1.00   2.00   1.00    -      -      -      -      -      -      -
 489
 490   Resource pressure by instruction:
 491   [0]    [1]    [2]    [3]    [4]    [5]    [6]    [7]    [8]    [9]    [10]   [11]   [12]   [13]   Instructions:
 492    -      -      -      -     1.00    -     1.00    -      -      -      -      -      -      -     vmulps      %xmm0, %xmm1, %xmm2
 493    -      -      -     1.00    -     1.00    -      -      -      -      -      -      -      -     vhaddps     %xmm2, %xmm2, %xmm3
 494    -      -      -     1.00    -     1.00    -      -      -      -      -      -      -      -     vhaddps     %xmm3, %xmm3, %xmm4
 495
 496 According to this report, the dot-product kernel has been executed 300 times,
 497 for a total of 900 simulated instructions. The total number of simulated micro
 498 opcodes (uOps) is also 900.
 499
 500 The report is structured in three main sections.  The first section collects a
 501 few performance numbers; the goal of this section is to give a very quick
 502 overview of the performance throughput. Important performance indicators are
 503 **IPC**, **uOps Per Cycle**, and  **Block RThroughput** (Block Reciprocal
 504 Throughput).
 505
 506 Field *DispatchWidth* is the maximum number of micro opcodes that are dispatched
 507 to the out-of-order backend every simulated cycle. For processors with an
 508 in-order backend, *DispatchWidth* is the maximum number of micro opcodes issued
 509 to the backend every simulated cycle.
 510
 511 IPC is computed dividing the total number of simulated instructions by the total
 512 number of cycles.
 513
 514 Field *Block RThroughput* is the reciprocal of the block throughput. Block
 515 throughput is a theoretical quantity computed as the maximum number of blocks
 516 (i.e. iterations) that can be executed per simulated clock cycle in the absence
 517 of loop carried dependencies. Block throughput is superiorly limited by the
 518 dispatch rate, and the availability of hardware resources.
 519
 520 In the absence of loop-carried data dependencies, the observed IPC tends to a
 521 theoretical maximum which can be computed by dividing the number of instructions
 522 of a single iteration by the `Block RThroughput`.
 523
 524 Field 'uOps Per Cycle' is computed dividing the total number of simulated micro
 525 opcodes by the total number of cycles. A delta between Dispatch Width and this
 526 field is an indicator of a performance issue. In the absence of loop-carried
 527 data dependencies, the observed 'uOps Per Cycle' should tend to a theoretical
 528 maximum throughput which can be computed by dividing the number of uOps of a
 529 single iteration by the `Block RThroughput`.
 530
 531 Field *uOps Per Cycle* is bounded from above by the dispatch width. That is
 532 because the dispatch width limits the maximum size of a dispatch group. Both IPC
 533 and 'uOps Per Cycle' are limited by the amount of hardware parallelism. The
 534 availability of hardware resources affects the resource pressure distribution,
 535 and it limits the number of instructions that can be executed in parallel every
 536 cycle.  A delta between Dispatch Width and the theoretical maximum uOps per
 537 Cycle (computed by dividing the number of uOps of a single iteration by the
 538 `Block RThroughput`) is an indicator of a performance bottleneck caused by the
 539 lack of hardware resources.
 540 In general, the lower the Block RThroughput, the better.
 541
 542 In this example, ``uOps per iteration/Block RThroughput`` is 1.50. Since there
 543 are no loop-carried dependencies, the observed `uOps Per Cycle` is expected to
 544 approach 1.50 when the number of iterations tends to infinity. The delta between
 545 the Dispatch Width (2.00), and the theoretical maximum uOp throughput (1.50) is
 546 an indicator of a performance bottleneck caused by the lack of hardware
 547 resources, and the *Resource pressure view* can help to identify the problematic
 548 resource usage.
 549
 550 The second section of the report is the `instruction info view`. It shows the
 551 latency and reciprocal throughput of every instruction in the sequence. It also
 552 reports extra information related to the number of micro opcodes, and opcode
 553 properties (i.e., 'MayLoad', 'MayStore', and 'HasSideEffects').
 554
 555 Field *RThroughput* is the reciprocal of the instruction throughput. Throughput
 556 is computed as the maximum number of instructions of a same type that can be
 557 executed per clock cycle in the absence of operand dependencies. In this
 558 example, the reciprocal throughput of a vector float multiply is 1
 559 cycles/instruction.  That is because the FP multiplier JFPM is only available
 560 from pipeline JFPU1.
 561
 562 Instruction encodings are displayed within the instruction info view when flag
 563 `-show-encoding` is specified.
 564
 565 Below is an example of `-show-encoding` output for the dot-product kernel:
 566
 567 .. code-block:: none
 568
 569   Instruction Info:
 570   [1]: #uOps
 571   [2]: Latency
 572   [3]: RThroughput
 573   [4]: MayLoad
 574   [5]: MayStore
 575   [6]: HasSideEffects (U)
 576   [7]: Encoding Size
 577
 578   [1]    [2]    [3]    [4]    [5]    [6]    [7]    Encodings:                    Instructions:
 579    1      2     1.00                         4     c5 f0 59 d0                   vmulps %xmm0, %xmm1, %xmm2
 580    1      4     1.00                         4     c5 eb 7c da                   vhaddps        %xmm2, %xmm2, %xmm3
 581    1      4     1.00                         4     c5 e3 7c e3                   vhaddps        %xmm3, %xmm3, %xmm4
 582
 583 The `Encoding Size` column shows the size in bytes of instructions.  The
 584 `Encodings` column shows the actual instruction encodings (byte sequences in
 585 hex).
 586
 587 The third section is the *Resource pressure view*.  This view reports
 588 the average number of resource cycles consumed every iteration by instructions
 589 for every processor resource unit available on the target.  Information is
 590 structured in two tables. The first table reports the number of resource cycles
 591 spent on average every iteration. The second table correlates the resource
 592 cycles to the machine instruction in the sequence. For example, every iteration
 593 of the instruction vmulps always executes on resource unit [6]
 594 (JFPU1 - floating point pipeline #1), consuming an average of 1 resource cycle
 595 per iteration.  Note that on AMD Jaguar, vector floating-point multiply can
 596 only be issued to pipeline JFPU1, while horizontal floating-point additions can
 597 only be issued to pipeline JFPU0.
 598
 599 The resource pressure view helps with identifying bottlenecks caused by high
 600 usage of specific hardware resources.  Situations with resource pressure mainly
 601 concentrated on a few resources should, in general, be avoided.  Ideally,
 602 pressure should be uniformly distributed between multiple resources.
 603
 604 Timeline View
 605 ^^^^^^^^^^^^^
 606 The timeline view produces a detailed report of each instruction's state
 607 transitions through an instruction pipeline.  This view is enabled by the
 608 command line option ``-timeline``.  As instructions transition through the
 609 various stages of the pipeline, their states are depicted in the view report.
 610 These states are represented by the following characters:
 611
 612 * D : Instruction dispatched.
 613 * e : Instruction executing.
 614 * E : Instruction executed.
 615 * R : Instruction retired.
 616 * = : Instruction already dispatched, waiting to be executed.
 617 * \- : Instruction executed, waiting to be retired.
 618
 619 Below is the timeline view for a subset of the dot-product example located in
 620 ``test/tools/llvm-mca/X86/BtVer2/dot-product.s`` and processed by
 621 :program:`llvm-mca` using the following command:
 622
 623 .. code-block:: bash
 624
 625   $ llvm-mca -mtriple=x86_64-unknown-unknown -mcpu=btver2 -iterations=3 -timeline dot-product.s
 626
 627 .. code-block:: none
 628
 629   Timeline view:
 630                       012345
 631   Index     0123456789
 632
 633   [0,0]     DeeER.    .    .   vmulps   %xmm0, %xmm1, %xmm2
 634   [0,1]     D==eeeER  .    .   vhaddps  %xmm2, %xmm2, %xmm3
 635   [0,2]     .D====eeeER    .   vhaddps  %xmm3, %xmm3, %xmm4
 636   [1,0]     .DeeE-----R    .   vmulps   %xmm0, %xmm1, %xmm2
 637   [1,1]     . D=eeeE---R   .   vhaddps  %xmm2, %xmm2, %xmm3
 638   [1,2]     . D====eeeER   .   vhaddps  %xmm3, %xmm3, %xmm4
 639   [2,0]     .  DeeE-----R  .   vmulps   %xmm0, %xmm1, %xmm2
 640   [2,1]     .  D====eeeER  .   vhaddps  %xmm2, %xmm2, %xmm3
 641   [2,2]     .   D======eeeER   vhaddps  %xmm3, %xmm3, %xmm4
 642
 643
 644   Average Wait times (based on the timeline view):
 645   [0]: Executions
 646   [1]: Average time spent waiting in a scheduler's queue
 647   [2]: Average time spent waiting in a scheduler's queue while ready
 648   [3]: Average time elapsed from WB until retire stage
 649
 650         [0]    [1]    [2]    [3]
 651   0.     3     1.0    1.0    3.3       vmulps   %xmm0, %xmm1, %xmm2
 652   1.     3     3.3    0.7    1.0       vhaddps  %xmm2, %xmm2, %xmm3
 653   2.     3     5.7    0.0    0.0       vhaddps  %xmm3, %xmm3, %xmm4
 654          3     3.3    0.5    1.4       <total>
 655
 656 The timeline view is interesting because it shows instruction state changes
 657 during execution.  It also gives an idea of how the tool processes instructions
 658 executed on the target, and how their timing information might be calculated.
 659
 660 The timeline view is structured in two tables.  The first table shows
 661 instructions changing state over time (measured in cycles); the second table
 662 (named *Average Wait times*) reports useful timing statistics, which should
 663 help diagnose performance bottlenecks caused by long data dependencies and
 664 sub-optimal usage of hardware resources.
 665
 666 An instruction in the timeline view is identified by a pair of indices, where
 667 the first index identifies an iteration, and the second index is the
 668 instruction index (i.e., where it appears in the code sequence).  Since this
 669 example was generated using 3 iterations: ``-iterations=3``, the iteration
 670 indices range from 0-2 inclusively.
 671
 672 Excluding the first and last column, the remaining columns are in cycles.
 673 Cycles are numbered sequentially starting from 0.
 674
 675 From the example output above, we know the following:
 676
 677 * Instruction [1,0] was dispatched at cycle 1.
 678 * Instruction [1,0] started executing at cycle 2.
 679 * Instruction [1,0] reached the write back stage at cycle 4.
 680 * Instruction [1,0] was retired at cycle 10.
 681
 682 Instruction [1,0] (i.e., vmulps from iteration #1) does not have to wait in the
 683 scheduler's queue for the operands to become available. By the time vmulps is
 684 dispatched, operands are already available, and pipeline JFPU1 is ready to
 685 serve another instruction.  So the instruction can be immediately issued on the
 686 JFPU1 pipeline. That is demonstrated by the fact that the instruction only
 687 spent 1cy in the scheduler's queue.
 688
 689 There is a gap of 5 cycles between the write-back stage and the retire event.
 690 That is because instructions must retire in program order, so [1,0] has to wait
 691 for [0,2] to be retired first (i.e., it has to wait until cycle 10).
 692
 693 In the example, all instructions are in a RAW (Read After Write) dependency
 694 chain.  Register %xmm2 written by vmulps is immediately used by the first
 695 vhaddps, and register %xmm3 written by the first vhaddps is used by the second
 696 vhaddps.  Long data dependencies negatively impact the ILP (Instruction Level
 697 Parallelism).
 698
 699 In the dot-product example, there are anti-dependencies introduced by
 700 instructions from different iterations.  However, those dependencies can be
 701 removed at register renaming stage (at the cost of allocating register aliases,
 702 and therefore consuming physical registers).
 703
 704 Table *Average Wait times* helps diagnose performance issues that are caused by
 705 the presence of long latency instructions and potentially long data dependencies
 706 which may limit the ILP. Last row, ``<total>``, shows a global average over all
 707 instructions measured. Note that :program:`llvm-mca`, by default, assumes at
 708 least 1cy between the dispatch event and the issue event.
 709
 710 When the performance is limited by data dependencies and/or long latency
 711 instructions, the number of cycles spent while in the *ready* state is expected
 712 to be very small when compared with the total number of cycles spent in the
 713 scheduler's queue.  The difference between the two counters is a good indicator
 714 of how large of an impact data dependencies had on the execution of the
 715 instructions.  When performance is mostly limited by the lack of hardware
 716 resources, the delta between the two counters is small.  However, the number of
 717 cycles spent in the queue tends to be larger (i.e., more than 1-3cy),
 718 especially when compared to other low latency instructions.
 719
 720 Bottleneck Analysis
 721 ^^^^^^^^^^^^^^^^^^^
 722 The ``-bottleneck-analysis`` command line option enables the analysis of
 723 performance bottlenecks.
 724
 725 This analysis is potentially expensive. It attempts to correlate increases in
 726 backend pressure (caused by pipeline resource pressure and data dependencies) to
 727 dynamic dispatch stalls.
 728
 729 Below is an example of ``-bottleneck-analysis`` output generated by
 730 :program:`llvm-mca` for 500 iterations of the dot-product example on btver2.
 731
 732 .. code-block:: none
 733
 734
 735   Cycles with backend pressure increase [ 48.07% ]
 736   Throughput Bottlenecks:
 737     Resource Pressure       [ 47.77% ]
 738     - JFPA  [ 47.77% ]
 739     - JFPU0  [ 47.77% ]
 740     Data Dependencies:      [ 0.30% ]
 741     - Register Dependencies [ 0.30% ]
 742     - Memory Dependencies   [ 0.00% ]
 743
 744   Critical sequence based on the simulation:
 745
 746                 Instruction                         Dependency Information
 747    +----< 2.    vhaddps %xmm3, %xmm3, %xmm4
 748    |
 749    |    < loop carried >
 750    |
 751    |      0.    vmulps  %xmm0, %xmm1, %xmm2
 752    +----> 1.    vhaddps %xmm2, %xmm2, %xmm3         ## RESOURCE interference:  JFPA [ probability: 74% ]
 753    +----> 2.    vhaddps %xmm3, %xmm3, %xmm4         ## REGISTER dependency:  %xmm3
 754    |
 755    |    < loop carried >
 756    |
 757    +----> 1.    vhaddps %xmm2, %xmm2, %xmm3         ## RESOURCE interference:  JFPA [ probability: 74% ]
 758
 759
 760 According to the analysis, throughput is limited by resource pressure and not by
 761 data dependencies.  The analysis observed increases in backend pressure during
 762 48.07% of the simulated run. Almost all those pressure increase events were
 763 caused by contention on processor resources JFPA/JFPU0.
 764
 765 The `critical sequence` is the most expensive sequence of instructions according
 766 to the simulation. It is annotated to provide extra information about critical
 767 register dependencies and resource interferences between instructions.
 768
 769 Instructions from the critical sequence are expected to significantly impact
 770 performance. By construction, the accuracy of this analysis is strongly
 771 dependent on the simulation and (as always) by the quality of the processor
 772 model in llvm.
 773
 774 Bottleneck analysis is currently not supported for processors with an in-order
 775 backend.
 776
 777 Extra Statistics to Further Diagnose Performance Issues
 778 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 779 The ``-all-stats`` command line option enables extra statistics and performance
 780 counters for the dispatch logic, the reorder buffer, the retire control unit,
 781 and the register file.
 782
 783 Below is an example of ``-all-stats`` output generated by  :program:`llvm-mca`
 784 for 300 iterations of the dot-product example discussed in the previous
 785 sections.
 786
 787 .. code-block:: none
 788
 789   Dynamic Dispatch Stall Cycles:
 790   RAT     - Register unavailable:                      0
 791   RCU     - Retire tokens unavailable:                 0
 792   SCHEDQ  - Scheduler full:                            272  (44.6%)
 793   LQ      - Load queue full:                           0
 794   SQ      - Store queue full:                          0
 795   GROUP   - Static restrictions on the dispatch group: 0
 796
 797
 798   Dispatch Logic - number of cycles where we saw N micro opcodes dispatched:
 799   [# dispatched], [# cycles]
 800    0,              24  (3.9%)
 801    1,              272  (44.6%)
 802    2,              314  (51.5%)
 803
 804
 805   Schedulers - number of cycles where we saw N micro opcodes issued:
 806   [# issued], [# cycles]
 807    0,          7  (1.1%)
 808    1,          306  (50.2%)
 809    2,          297  (48.7%)
 810
 811   Scheduler's queue usage:
 812   [1] Resource name.
 813   [2] Average number of used buffer entries.
 814   [3] Maximum number of used buffer entries.
 815   [4] Total number of buffer entries.
 816
 817    [1]            [2]        [3]        [4]
 818   JALU01           0          0          20
 819   JFPU01           17         18         18
 820   JLSAGU           0          0          12
 821
 822
 823   Retire Control Unit - number of cycles where we saw N instructions retired:
 824   [# retired], [# cycles]
 825    0,           109  (17.9%)
 826    1,           102  (16.7%)
 827    2,           399  (65.4%)
 828
 829   Total ROB Entries:                64
 830   Max Used ROB Entries:             35  ( 54.7% )
 831   Average Used ROB Entries per cy:  32  ( 50.0% )
 832
 833
 834   Register File statistics:
 835   Total number of mappings created:    900
 836   Max number of mappings used:         35
 837
 838   *  Register File #1 -- JFpuPRF:
 839      Number of physical registers:     72
 840      Total number of mappings created: 900
 841      Max number of mappings used:      35
 842
 843   *  Register File #2 -- JIntegerPRF:
 844      Number of physical registers:     64
 845      Total number of mappings created: 0
 846      Max number of mappings used:      0
 847
 848 If we look at the *Dynamic Dispatch Stall Cycles* table, we see the counter for
 849 SCHEDQ reports 272 cycles.  This counter is incremented every time the dispatch
 850 logic is unable to dispatch a full group because the scheduler's queue is full.
 851
 852 Looking at the *Dispatch Logic* table, we see that the pipeline was only able to
 853 dispatch two micro opcodes 51.5% of the time.  The dispatch group was limited to
 854 one micro opcode 44.6% of the cycles, which corresponds to 272 cycles.  The
 855 dispatch statistics are displayed by either using the command option
 856 ``-all-stats`` or ``-dispatch-stats``.
 857
 858 The next table, *Schedulers*, presents a histogram displaying a count,
 859 representing the number of micro opcodes issued on some number of cycles. In
 860 this case, of the 610 simulated cycles, single opcodes were issued 306 times
 861 (50.2%) and there were 7 cycles where no opcodes were issued.
 862
 863 The *Scheduler's queue usage* table shows that the average and maximum number of
 864 buffer entries (i.e., scheduler queue entries) used at runtime.  Resource JFPU01
 865 reached its maximum (18 of 18 queue entries). Note that AMD Jaguar implements
 866 three schedulers:
 867
 868 * JALU01 - A scheduler for ALU instructions.
 869 * JFPU01 - A scheduler floating point operations.
 870 * JLSAGU - A scheduler for address generation.
 871
 872 The dot-product is a kernel of three floating point instructions (a vector
 873 multiply followed by two horizontal adds).  That explains why only the floating
 874 point scheduler appears to be used.
 875
 876 A full scheduler queue is either caused by data dependency chains or by a
 877 sub-optimal usage of hardware resources.  Sometimes, resource pressure can be
 878 mitigated by rewriting the kernel using different instructions that consume
 879 different scheduler resources.  Schedulers with a small queue are less resilient
 880 to bottlenecks caused by the presence of long data dependencies.  The scheduler
 881 statistics are displayed by using the command option ``-all-stats`` or
 882 ``-scheduler-stats``.
 883
 884 The next table, *Retire Control Unit*, presents a histogram displaying a count,
 885 representing the number of instructions retired on some number of cycles.  In
 886 this case, of the 610 simulated cycles, two instructions were retired during the
 887 same cycle 399 times (65.4%) and there were 109 cycles where no instructions
 888 were retired.  The retire statistics are displayed by using the command option
 889 ``-all-stats`` or ``-retire-stats``.
 890
 891 The last table presented is *Register File statistics*.  Each physical register
 892 file (PRF) used by the pipeline is presented in this table.  In the case of AMD
 893 Jaguar, there are two register files, one for floating-point registers (JFpuPRF)
 894 and one for integer registers (JIntegerPRF).  The table shows that of the 900
 895 instructions processed, there were 900 mappings created.  Since this dot-product
 896 example utilized only floating point registers, the JFPuPRF was responsible for
 897 creating the 900 mappings.  However, we see that the pipeline only used a
 898 maximum of 35 of 72 available register slots at any given time. We can conclude
 899 that the floating point PRF was the only register file used for the example, and
 900 that it was never resource constrained.  The register file statistics are
 901 displayed by using the command option ``-all-stats`` or
 902 ``-register-file-stats``.
 903
 904 In this example, we can conclude that the IPC is mostly limited by data
 905 dependencies, and not by resource pressure.
 906
 907 Instruction Flow
 908 ^^^^^^^^^^^^^^^^
 909 This section describes the instruction flow through the default pipeline of
 910 :program:`llvm-mca`, as well as the functional units involved in the process.
 911
 912 The default pipeline implements the following sequence of stages used to
 913 process instructions.
 914
 915 * Dispatch (Instruction is dispatched to the schedulers).
 916 * Issue (Instruction is issued to the processor pipelines).
 917 * Write Back (Instruction is executed, and results are written back).
 918 * Retire (Instruction is retired; writes are architecturally committed).
 919
 920 The in-order pipeline implements the following sequence of stages:
 921
 922 * InOrderIssue (Instruction is issued to the processor pipelines).
 923 * Retire (Instruction is retired; writes are architecturally committed).
 924
 925 :program:`llvm-mca` assumes that instructions have all been decoded and placed
 926 into a queue before the simulation start. Therefore, the instruction fetch and
 927 decode stages are not modeled. Performance bottlenecks in the frontend are not
 928 diagnosed. Also, :program:`llvm-mca` does not model branch prediction.
 929
 930 Instruction Dispatch
 931 """"""""""""""""""""
 932 During the dispatch stage, instructions are picked in program order from a
 933 queue of already decoded instructions, and dispatched in groups to the
 934 simulated hardware schedulers.
 935
 936 The size of a dispatch group depends on the availability of the simulated
 937 hardware resources.  The processor dispatch width defaults to the value
 938 of the ``IssueWidth`` in LLVM's scheduling model.
 939
 940 An instruction can be dispatched if:
 941
 942 * The size of the dispatch group is smaller than processor's dispatch width.
 943 * There are enough entries in the reorder buffer.
 944 * There are enough physical registers to do register renaming.
 945 * The schedulers are not full.
 946
 947 Scheduling models can optionally specify which register files are available on
 948 the processor. :program:`llvm-mca` uses that information to initialize register
 949 file descriptors.  Users can limit the number of physical registers that are
 950 globally available for register renaming by using the command option
 951 ``-register-file-size``.  A value of zero for this option means *unbounded*. By
 952 knowing how many registers are available for renaming, the tool can predict
 953 dispatch stalls caused by the lack of physical registers.
 954
 955 The number of reorder buffer entries consumed by an instruction depends on the
 956 number of micro-opcodes specified for that instruction by the target scheduling
 957 model.  The reorder buffer is responsible for tracking the progress of
 958 instructions that are "in-flight", and retiring them in program order.  The
 959 number of entries in the reorder buffer defaults to the value specified by field
 960 `MicroOpBufferSize` in the target scheduling model.
 961
 962 Instructions that are dispatched to the schedulers consume scheduler buffer
 963 entries. :program:`llvm-mca` queries the scheduling model to determine the set
 964 of buffered resources consumed by an instruction.  Buffered resources are
 965 treated like scheduler resources.
 966
 967 Instruction Issue
 968 """""""""""""""""
 969 Each processor scheduler implements a buffer of instructions.  An instruction
 970 has to wait in the scheduler's buffer until input register operands become
 971 available.  Only at that point, does the instruction becomes eligible for
 972 execution and may be issued (potentially out-of-order) for execution.
 973 Instruction latencies are computed by :program:`llvm-mca` with the help of the
 974 scheduling model.
 975
 976 :program:`llvm-mca`'s scheduler is designed to simulate multiple processor
 977 schedulers.  The scheduler is responsible for tracking data dependencies, and
 978 dynamically selecting which processor resources are consumed by instructions.
 979 It delegates the management of processor resource units and resource groups to a
 980 resource manager.  The resource manager is responsible for selecting resource
 981 units that are consumed by instructions.  For example, if an instruction
 982 consumes 1cy of a resource group, the resource manager selects one of the
 983 available units from the group; by default, the resource manager uses a
 984 round-robin selector to guarantee that resource usage is uniformly distributed
 985 between all units of a group.
 986
 987 :program:`llvm-mca`'s scheduler internally groups instructions into three sets:
 988
 989 * WaitSet: a set of instructions whose operands are not ready.
 990 * ReadySet: a set of instructions ready to execute.
 991 * IssuedSet: a set of instructions executing.
 992
 993 Depending on the operands availability, instructions that are dispatched to the
 994 scheduler are either placed into the WaitSet or into the ReadySet.
 995
 996 Every cycle, the scheduler checks if instructions can be moved from the WaitSet
 997 to the ReadySet, and if instructions from the ReadySet can be issued to the
 998 underlying pipelines. The algorithm prioritizes older instructions over younger
 999 instructions.
1000
1001 Write-Back and Retire Stage
1002 """""""""""""""""""""""""""
1003 Issued instructions are moved from the ReadySet to the IssuedSet.  There,
1004 instructions wait until they reach the write-back stage.  At that point, they
1005 get removed from the queue and the retire control unit is notified.
1006
1007 When instructions are executed, the retire control unit flags the instruction as
1008 "ready to retire."
1009
1010 Instructions are retired in program order.  The register file is notified of the
1011 retirement so that it can free the physical registers that were allocated for
1012 the instruction during the register renaming stage.
1013
1014 Load/Store Unit and Memory Consistency Model
1015 """"""""""""""""""""""""""""""""""""""""""""
1016 To simulate an out-of-order execution of memory operations, :program:`llvm-mca`
1017 utilizes a simulated load/store unit (LSUnit) to simulate the speculative
1018 execution of loads and stores.
1019
1020 Each load (or store) consumes an entry in the load (or store) queue. Users can
1021 specify flags ``-lqueue`` and ``-squeue`` to limit the number of entries in the
1022 load and store queues respectively. The queues are unbounded by default.
1023
1024 The LSUnit implements a relaxed consistency model for memory loads and stores.
1025 The rules are:
1026
1027 1. A younger load is allowed to pass an older load only if there are no
1028    intervening stores or barriers between the two loads.
1029 2. A younger load is allowed to pass an older store provided that the load does
1030    not alias with the store.
1031 3. A younger store is not allowed to pass an older store.
1032 4. A younger store is not allowed to pass an older load.
1033
1034 By default, the LSUnit optimistically assumes that loads do not alias
1035 (`-noalias=true`) store operations.  Under this assumption, younger loads are
1036 always allowed to pass older stores.  Essentially, the LSUnit does not attempt
1037 to run any alias analysis to predict when loads and stores do not alias with
1038 each other.
1039
1040 Note that, in the case of write-combining memory, rule 3 could be relaxed to
1041 allow reordering of non-aliasing store operations.  That being said, at the
1042 moment, there is no way to further relax the memory model (``-noalias`` is the
1043 only option).  Essentially, there is no option to specify a different memory
1044 type (e.g., write-back, write-combining, write-through; etc.) and consequently
1045 to weaken, or strengthen, the memory model.
1046
1047 Other limitations are:
1048
1049 * The LSUnit does not know when store-to-load forwarding may occur.
1050 * The LSUnit does not know anything about cache hierarchy and memory types.
1051 * The LSUnit does not know how to identify serializing operations and memory
1052   fences.
1053
1054 The LSUnit does not attempt to predict if a load or store hits or misses the L1
1055 cache.  It only knows if an instruction "MayLoad" and/or "MayStore."  For
1056 loads, the scheduling model provides an "optimistic" load-to-use latency (which
1057 usually matches the load-to-use latency for when there is a hit in the L1D).
1058
1059 :program:`llvm-mca` does not (on its own) know about serializing operations or
1060 memory-barrier like instructions.  The LSUnit used to conservatively use an
1061 instruction's "MayLoad", "MayStore", and unmodeled side effects flags to
1062 determine whether an instruction should be treated as a memory-barrier. This was
1063 inaccurate in general and was changed so that now each instruction has an
1064 IsAStoreBarrier and IsALoadBarrier flag. These flags are mca specific and
1065 default to false for every instruction. If any instruction should have either of
1066 these flags set, it should be done within the target's InstrPostProcess class.
1067 For an example, look at the `X86InstrPostProcess::postProcessInstruction` method
1068 within `llvm/lib/Target/X86/MCA/X86CustomBehaviour.cpp`.
1069
1070 A load/store barrier consumes one entry of the load/store queue.  A load/store
1071 barrier enforces ordering of loads/stores.  A younger load cannot pass a load
1072 barrier.  Also, a younger store cannot pass a store barrier.  A younger load
1073 has to wait for the memory/load barrier to execute.  A load/store barrier is
1074 "executed" when it becomes the oldest entry in the load/store queue(s). That
1075 also means, by construction, all of the older loads/stores have been executed.
1076
1077 In conclusion, the full set of load/store consistency rules are:
1078
1079 #. A store may not pass a previous store.
1080 #. A store may not pass a previous load (regardless of ``-noalias``).
1081 #. A store has to wait until an older store barrier is fully executed.
1082 #. A load may pass a previous load.
1083 #. A load may not pass a previous store unless ``-noalias`` is set.
1084 #. A load has to wait until an older load barrier is fully executed.
1085
1086 In-order Issue and Execute
1087 """"""""""""""""""""""""""""""""""""
1088 In-order processors are modelled as a single ``InOrderIssueStage`` stage. It
1089 bypasses Dispatch, Scheduler and Load/Store unit. Instructions are issued as
1090 soon as their operand registers are available and resource requirements are
1091 met. Multiple instructions can be issued in one cycle according to the value of
1092 the ``IssueWidth`` parameter in LLVM's scheduling model.
1093
1094 Once issued, an instruction is moved to ``IssuedInst`` set until it is ready to
1095 retire. :program:`llvm-mca` ensures that writes are committed in-order. However,
1096 an instruction is allowed to commit writes and retire out-of-order if
1097 ``RetireOOO`` property is true for at least one of its writes.
1098
1099 Custom Behaviour
1100 """"""""""""""""""""""""""""""""""""
1101 Due to certain instructions not being expressed perfectly within their
1102 scheduling model, :program:`llvm-mca` isn't always able to simulate them
1103 perfectly. Modifying the scheduling model isn't always a viable
1104 option though (maybe because the instruction is modeled incorrectly on
1105 purpose or the instruction's behaviour is quite complex). The
1106 CustomBehaviour class can be used in these cases to enforce proper
1107 instruction modeling (often by customizing data dependencies and detecting
1108 hazards that :program:`llvm-mca` has no way of knowing about).
1109
1110 :program:`llvm-mca` comes with one generic and multiple target specific
1111 CustomBehaviour classes. The generic class will be used if the ``-disable-cb``
1112 flag is used or if a target specific CustomBehaviour class doesn't exist for
1113 that target. (The generic class does nothing.) Currently, the CustomBehaviour
1114 class is only a part of the in-order pipeline, but there are plans to add it
1115 to the out-of-order pipeline in the future.
1116
1117 CustomBehaviour's main method is `checkCustomHazard()` which uses the
1118 current instruction and a list of all instructions still executing within
1119 the pipeline to determine if the current instruction should be dispatched.
1120 As output, the method returns an integer representing the number of cycles
1121 that the current instruction must stall for (this can be an underestimate
1122 if you don't know the exact number and a value of 0 represents no stall).
1123
1124 If you'd like to add a CustomBehaviour class for a target that doesn't
1125 already have one, refer to an existing implementation to see how to set it
1126 up. The classes are implemented within the target specific backend (for
1127 example `/llvm/lib/Target/AMDGPU/MCA/`) so that they can access backend symbols.
1128
1129 Instrument Manager
1130 """"""""""""""""""""""""""""""""""""
1131 On certain architectures, scheduling information for certain instructions
1132 do not contain all of the information required to identify the most precise
1133 schedule class. For example, data that can have an impact on scheduling can
1134 be stored in CSR registers.
1135
1136 One example of this is on RISCV, where values in registers such as `vtype`
1137 and `vl` change the scheduling behaviour of vector instructions. Since MCA
1138 does not keep track of the values in registers, instrument comments can
1139 be used to specify these values.
1140
1141 InstrumentManager's main function is `getSchedClassID()` which has access
1142 to the MCInst and all of the instruments that are active for that MCInst.
1143 This function can use the instruments to override the schedule class of
1144 the MCInst.
1145
1146 On RISCV, instrument comments containing LMUL information are used
1147 by `getSchedClassID()` to map a vector instruction and the active
1148 LMUL to the scheduling class of the pseudo-instruction that describes
1149 that base instruction and the active LMUL.
1150
1151 Custom Views
1152 """"""""""""""""""""""""""""""""""""
1153 :program:`llvm-mca` comes with several Views such as the Timeline View and
1154 Summary View. These Views are generic and can work with most (if not all)
1155 targets. If you wish to add a new View to :program:`llvm-mca` and it does not
1156 require any backend functionality that is not already exposed through MC layer
1157 classes (MCSubtargetInfo, MCInstrInfo, etc.), please add it to the
1158 `/tools/llvm-mca/View/` directory. However, if your new View is target specific
1159 AND requires unexposed backend symbols or functionality, you can define it in
1160 the `/lib/Target/<TargetName>/MCA/` directory.
1161
1162 To enable this target specific View, you will have to use this target's
1163 CustomBehaviour class to override the `CustomBehaviour::getViews()` methods.
1164 There are 3 variations of these methods based on where you want your View to
1165 appear in the output: `getStartViews()`, `getPostInstrInfoViews()`, and
1166 `getEndViews()`. These methods returns a vector of Views so you will want to
1167 return a vector containing all of the target specific Views for the target in
1168 question.
1169
1170 Because these target specific (and backend dependent) Views require the
1171 `CustomBehaviour::getViews()` variants, these Views will not be enabled if
1172 the `-disable-cb` flag is used.
1173
1174 Enabling these custom Views does not affect the non-custom (generic) Views.
1175 Continue to use the usual command line arguments to enable / disable those
1176 Views.