1 llvm-exegesis - LLVM Machine Instruction Benchmark
2 ==================================================
4 .. program:: llvm-exegesis
9 :program:`llvm-exegesis` [*options*]
14 :program:`llvm-exegesis` is a benchmarking tool that uses information available
15 in LLVM to measure host machine instruction characteristics like latency,
16 throughput, or port decomposition.
18 Given an LLVM opcode name and a benchmarking mode, :program:`llvm-exegesis`
19 generates a code snippet that makes execution as serial (resp. as parallel) as
20 possible so that we can measure the latency (resp. inverse throughput/uop decomposition)
22 The code snippet is jitted and, unless requested not to, executed on the
23 host subtarget. The time taken (resp. resource usage) is measured using
24 hardware performance counters. The result is printed out as YAML
25 to the standard output.
27 The main goal of this tool is to automatically (in)validate the LLVM's TableDef
28 scheduling models. To that end, we also provide analysis of the results.
30 :program:`llvm-exegesis` can also benchmark arbitrary user-provided code
36 :program:`llvm-exegesis` currently only supports X86 (64-bit only), ARM (AArch64
37 only), MIPS, and PowerPC (PowerPC64LE only) on Linux for benchmarking. Not all
38 benchmarking functionality is guaranteed to work on every platform.
39 :program:`llvm-exegesis` also has a separate analysis mode that is supported
40 on every platform that LLVM is.
45 :program:`llvm-exegesis` supports benchmarking arbitrary snippets of assembly.
46 However, benchmarking these snippets often requires some setup so that they
47 can execute properly. :program:`llvm-exegesis` has four annotations and some
48 additional utilities to help with setup so that snippets can be benchmarked
51 * `LLVM-EXEGESIS-DEFREG <register name>` - Adding this annotation to the text
52 assembly snippet to be benchmarked marks the register as requiring a definition.
53 A value will automatically be provided unless a second parameter, a hex value,
54 is passed in. This is done with the `LLVM-EXEGESIS-DEFREG <register name> <hex value>`
55 format. `<hex value>` is a bit pattern used to fill the register. If it is a
56 value smaller than the register, it is sign extended to match the size of the
58 * `LLVM-EXEGESIS-LIVEIN <register name>` - This annotation allows specifying
59 registers that should keep their value upon starting the benchmark. Values
60 can be passed through registers from the benchmarking setup in some cases.
61 The registers and the values assigned to them that can be utilized in the
62 benchmarking script with a `LLVM-EXEGESIS-LIVEIN` are as follows:
64 * Scratch memory register - The specific register that this value is put in
65 is platform dependent (e.g., it is the RDI register on X86 Linux). Setting
66 this register as a live in ensures that a pointer to a block of memory (1MB)
67 is placed within this register that can be used by the snippet.
68 * `LLVM-EXEGESIS-MEM-DEF <value name> <size> <value>` - This annotation allows
69 specifying memory definitions that can later be mapped into the execution
70 process of a snippet with the `LLVM-EXEGESIS-MEM-MAP` annotation. Each
71 value is named using the `<value name>` argument so that it can be referenced
72 later within a map annotation. The size is specified in a decimal number of
73 bytes and the value is given in hexadecimal. If the size of the value is less
74 than the specified size, the value will be repeated until it fills the entire
75 section of memory. Using this annotation requires using the subprocess execution
77 * `LLVM-EXEGESIS-MEM-MAP <value name> <address>` - This annotation allows for
78 mapping previously defined memory definitions into the execution context of a
79 process. The value name refers to a previously defined memory definition and
80 the address is a decimal number that specifies the address the memory
81 definition should start at. Note that a single memory definition can be
82 mapped multiple times. Using this annotation requires the subprocess
84 * `LLVM-EXEGESIS-SNIPPET-ADDRESS <address>` - This annotation allows for
85 setting the address where the beginning of the snippet to be executed will
86 be mapped in at. The address is given in hexadecimal. Note that the snippet
87 also includes setup code, so the instruction exactly at the specified
88 address will not be the first instruction in the snippet. Using this
89 annotation requires the subprocess execution mode. This is useful in
90 cases where the memory accessed by the snippet depends on the location
91 of the snippet, like RIP-relative addressing.
93 EXAMPLE 1: benchmarking instructions
94 ------------------------------------
96 Assume you have an X86-64 machine. To measure the latency of a single
101 $ llvm-exegesis --mode=latency --opcode-name=ADD64rr
103 Measuring the uop decomposition or inverse throughput of an instruction works similarly:
107 $ llvm-exegesis --mode=uops --opcode-name=ADD64rr
108 $ llvm-exegesis --mode=inverse_throughput --opcode-name=ADD64rr
111 The output is a YAML document (the default is to write to stdout, but you can
112 redirect the output to a file using `--benchmarks-file`):
122 llvm_triple: x86_64-unknown-linux-gnu
123 num_repetitions: 10000
125 - { key: latency, value: 1.0058, debug_string: '' }
127 info: 'explicit self cycles, selecting one aliasing configuration.
133 To measure the latency of all instructions for the host architecture, run:
137 $ llvm-exegesis --mode=latency --opcode-index=-1
140 EXAMPLE 2: benchmarking a custom code snippet
141 ---------------------------------------------
143 To measure the latency/uops of a custom piece of code, you can specify the
144 `snippets-file` option (`-` reads from standard input).
148 $ echo "vzeroupper" | llvm-exegesis --mode=uops --snippets-file=-
150 Real-life code snippets typically depend on registers or memory.
151 :program:`llvm-exegesis` checks the liveliness of registers (i.e. any register
152 use has a corresponding def or is a "live in"). If your code depends on the
153 value of some registers, you need to use snippet annotations to ensure setup
154 is performed properly.
156 For example, the following code snippet depends on the values of XMM1 (which
157 will be set by the tool) and the memory buffer passed in RDI (live in).
161 # LLVM-EXEGESIS-LIVEIN RDI
162 # LLVM-EXEGESIS-DEFREG XMM1 42
163 vmulps (%rdi), %xmm1, %xmm2
164 vhaddps %xmm2, %xmm2, %xmm3
168 Example 3: benchmarking with memory annotations
169 -----------------------------------------------
171 Some snippets require memory setup in specific places to execute without
172 crashing. Setting up memory can be accomplished with the `LLVM-EXEGESIS-MEM-DEF`
173 and `LLVM-EXEGESIS-MEM-MAP` annotations. To execute the following snippet:
180 We need to have at least eight bytes of memory allocated starting `0x2000`.
181 We can create the necessary execution environment with the following
182 annotations added to the snippet:
186 # LLVM-EXEGESIS-MEM-DEF test1 4096 2147483647
187 # LLVM-EXEGESIS-MEM-MAP test1 8192
195 Assuming you have a set of benchmarked instructions (either latency or uops) as
196 YAML in file `/tmp/benchmarks.yaml`, you can analyze the results using the
201 $ llvm-exegesis --mode=analysis \
202 --benchmarks-file=/tmp/benchmarks.yaml \
203 --analysis-clusters-output-file=/tmp/clusters.csv \
204 --analysis-inconsistencies-output-file=/tmp/inconsistencies.html
206 This will group the instructions into clusters with the same performance
207 characteristics. The clusters will be written out to `/tmp/clusters.csv` in the
212 cluster_id,opcode_name,config,sched_class
214 2,ADD32ri8_DB,,WriteALU,1.00
215 2,ADD32ri_DB,,WriteALU,1.01
216 2,ADD32rr,,WriteALU,1.01
217 2,ADD32rr_DB,,WriteALU,1.00
218 2,ADD32rr_REV,,WriteALU,1.00
219 2,ADD64i32,,WriteALU,1.01
220 2,ADD64ri32,,WriteALU,1.01
221 2,MOVSX64rr32,,BSWAP32r_BSWAP64r_MOVSX64rr32,1.00
222 2,VPADDQYrr,,VPADDBYrr_VPADDDYrr_VPADDQYrr_VPADDWYrr_VPSUBBYrr_VPSUBDYrr_VPSUBQYrr_VPSUBWYrr,1.02
223 2,VPSUBQYrr,,VPADDBYrr_VPADDDYrr_VPADDQYrr_VPADDWYrr_VPSUBBYrr_VPSUBDYrr_VPSUBQYrr_VPSUBWYrr,1.01
224 2,ADD64ri8,,WriteALU,1.00
225 2,SETBr,,WriteSETCC,1.01
228 :program:`llvm-exegesis` will also analyze the clusters to point out
229 inconsistencies in the scheduling information. The output is an html file. For
230 example, `/tmp/inconsistencies.html` will contain messages like the following :
232 .. image:: llvm-exegesis-analysis.png
235 Note that the scheduling class names will be resolved only when
236 :program:`llvm-exegesis` is compiled in debug mode, else only the class id will
237 be shown. This does not invalidate any of the analysis results though.
244 Print a summary of command line options.
246 .. option:: --opcode-index=<LLVM opcode index>
248 Specify the opcode to measure, by index. Specifying `-1` will result
249 in measuring every existing opcode. See example 1 for details.
250 Either `opcode-index`, `opcode-name` or `snippets-file` must be set.
252 .. option:: --opcode-name=<opcode name 1>,<opcode name 2>,...
254 Specify the opcode to measure, by name. Several opcodes can be specified as
255 a comma-separated list. See example 1 for details.
256 Either `opcode-index`, `opcode-name` or `snippets-file` must be set.
258 .. option:: --snippets-file=<filename>
260 Specify the custom code snippet to measure. See example 2 for details.
261 Either `opcode-index`, `opcode-name` or `snippets-file` must be set.
263 .. option:: --mode=[latency|uops|inverse_throughput|analysis]
265 Specify the run mode. Note that some modes have additional requirements and options.
267 `latency` mode can be make use of either RDTSC or LBR.
268 `latency[LBR]` is only available on X86 (at least `Skylake`).
269 To run in `latency` mode, a positive value must be specified
270 for `x86-lbr-sample-period` and `--repetition-mode=loop`.
272 In `analysis` mode, you also need to specify at least one of the
273 `-analysis-clusters-output-file=` and `-analysis-inconsistencies-output-file=`.
275 .. option:: --benchmark-phase=[prepare-snippet|prepare-and-assemble-snippet|assemble-measured-code|measure]
277 By default, when `-mode=` is specified, the generated snippet will be executed
278 and measured, and that requires that we are running on the hardware for which
279 the snippet was generated, and that supports performance measurements.
280 However, it is possible to stop at some stage before measuring. Choices are:
281 * ``prepare-snippet``: Only generate the minimal instruction sequence.
282 * ``prepare-and-assemble-snippet``: Same as ``prepare-snippet``, but also dumps an excerpt of the sequence (hex encoded).
283 * ``assemble-measured-code``: Same as ``prepare-and-assemble-snippet``. but also creates the full sequence that can be dumped to a file using ``--dump-object-to-disk``.
284 * ``measure``: Same as ``assemble-measured-code``, but also runs the measurement.
286 .. option:: --x86-lbr-sample-period=<nBranches/sample>
288 Specify the LBR sampling period - how many branches before we take a sample.
289 When a positive value is specified for this option and when the mode is `latency`,
290 we will use LBRs for measuring.
291 On choosing the "right" sampling period, a small value is preferred, but throttling
292 could occur if the sampling is too frequent. A prime number should be used to
293 avoid consistently skipping certain blocks.
295 .. option:: --x86-disable-upper-sse-registers
297 Using the upper xmm registers (xmm8-xmm15) forces a longer instruction encoding
298 which may put greater pressure on the frontend fetch and decode stages,
299 potentially reducing the rate that instructions are dispatched to the backend,
300 particularly on older hardware. Comparing baseline results with this mode
301 enabled can help determine the effects of the frontend and can be used to
302 improve latency and throughput estimates.
304 .. option:: --repetition-mode=[duplicate|loop|min]
306 Specify the repetition mode. `duplicate` will create a large, straight line
307 basic block with `num-repetitions` instructions (repeating the snippet
308 `num-repetitions`/`snippet size` times). `loop` will, optionally, duplicate the
309 snippet until the loop body contains at least `loop-body-size` instructions,
310 and then wrap the result in a loop which will execute `num-repetitions`
311 instructions (thus, again, repeating the snippet
312 `num-repetitions`/`snippet size` times). The `loop` mode, especially with loop
313 unrolling tends to better hide the effects of the CPU frontend on architectures
314 that cache decoded instructions, but consumes a register for counting
315 iterations. If performing an analysis over many opcodes, it may be best to
316 instead use the `min` mode, which will run each other mode,
317 and produce the minimal measured result.
319 .. option:: --num-repetitions=<Number of repetitions>
321 Specify the target number of executed instructions. Note that the actual
322 repetition count of the snippet will be `num-repetitions`/`snippet size`.
323 Higher values lead to more accurate measurements but lengthen the benchmark.
325 .. option:: --loop-body-size=<Preferred loop body size>
327 Only effective for `-repetition-mode=[loop|min]`.
328 Instead of looping over the snippet directly, first duplicate it so that the
329 loop body contains at least this many instructions. This potentially results
330 in loop body being cached in the CPU Op Cache / Loop Cache, which allows to
331 which may have higher throughput than the CPU decoders.
333 .. option:: --max-configs-per-opcode=<value>
335 Specify the maximum configurations that can be generated for each opcode.
336 By default this is `1`, meaning that we assume that a single measurement is
337 enough to characterize an opcode. This might not be true of all instructions:
338 for example, the performance characteristics of the LEA instruction on X86
339 depends on the value of assigned registers and immediates. Setting a value of
340 `-max-configs-per-opcode` larger than `1` allows `llvm-exegesis` to explore
341 more configurations to discover if some register or immediate assignments
342 lead to different performance characteristics.
345 .. option:: --benchmarks-file=</path/to/file>
347 File to read (`analysis` mode) or write (`latency`/`uops`/`inverse_throughput`
348 modes) benchmark results. "-" uses stdin/stdout.
350 .. option:: --analysis-clusters-output-file=</path/to/file>
352 If provided, write the analysis clusters as CSV to this file. "-" prints to
353 stdout. By default, this analysis is not run.
355 .. option:: --analysis-inconsistencies-output-file=</path/to/file>
357 If non-empty, write inconsistencies found during analysis to this file. `-`
358 prints to stdout. By default, this analysis is not run.
360 .. option:: --analysis-filter=[all|reg-only|mem-only]
362 By default, all benchmark results are analysed, but sometimes it may be useful
363 to only look at those that to not involve memory, or vice versa. This option
364 allows to either keep all benchmarks, or filter out (ignore) either all the
365 ones that do involve memory (involve instructions that may read or write to
366 memory), or the opposite, to only keep such benchmarks.
368 .. option:: --analysis-clustering=[dbscan,naive]
370 Specify the clustering algorithm to use. By default DBSCAN will be used.
371 Naive clustering algorithm is better for doing further work on the
372 `-analysis-inconsistencies-output-file=` output, it will create one cluster
373 per opcode, and check that the cluster is stable (all points are neighbours).
375 .. option:: --analysis-numpoints=<dbscan numPoints parameter>
377 Specify the numPoints parameters to be used for DBSCAN clustering
378 (`analysis` mode, DBSCAN only).
380 .. option:: --analysis-clustering-epsilon=<dbscan epsilon parameter>
382 Specify the epsilon parameter used for clustering of benchmark points
385 .. option:: --analysis-inconsistency-epsilon=<epsilon>
387 Specify the epsilon parameter used for detection of when the cluster
388 is different from the LLVM schedule profile values (`analysis` mode).
390 .. option:: --analysis-display-unstable-clusters
392 If there is more than one benchmark for an opcode, said benchmarks may end up
393 not being clustered into the same cluster if the measured performance
394 characteristics are different. by default all such opcodes are filtered out.
395 This flag will instead show only such unstable opcodes.
397 .. option:: --ignore-invalid-sched-class=false
399 If set, ignore instructions that do not have a sched class (class idx = 0).
401 .. option:: --mtriple=<triple name>
403 Target triple. See `-version` for available targets.
405 .. option:: --mcpu=<cpu name>
407 If set, measure the cpu characteristics using the counters for this CPU. This
408 is useful when creating new sched models (the host CPU is unknown to LLVM).
409 (`-mcpu=help` for details)
411 .. option:: --analysis-override-benchmark-triple-and-cpu
413 By default, llvm-exegesis will analyze the benchmarks for the triple/CPU they
414 were measured for, but if you want to analyze them for some other combination
415 (specified via `-mtriple`/`-mcpu`), you can pass this flag.
417 .. option:: --dump-object-to-disk=true
419 If set, llvm-exegesis will dump the generated code to a temporary file to
420 enable code inspection. Disabled by default.
422 .. option:: --use-dummy-perf-counters
424 If set, llvm-exegesis will not read any real performance counters and
425 return a dummy value instead. This can be used to ensure a snippet doesn't
426 crash when hardware performance counters are unavailable and for
427 debugging :program:`llvm-exegesis` itself.
429 .. option:: --execution-mode=[inprocess,subprocess]
431 This option specifies what execution mode to use. The `inprocess` execution
432 mode is the default. The `subprocess` execution mode allows for additional
433 features such as memory annotations but is currently restricted to X86-64
436 .. option:: --benchmark-repeat-count=<repeat-count>
438 This option enables specifying the number of times to repeat the measurement
439 when performing latency measurements. By default, llvm-exegesis will repeat
440 a latency measurement enough times to balance run-time and noise reduction.
445 :program:`llvm-exegesis` returns 0 on success. Otherwise, an error message is
446 printed to standard error, and the tool returns a non 0 value.