bolt/README.md

   1 # BOLT
   2
   3 BOLT is a post-link optimizer developed to speed up large applications.
   4 It achieves the improvements by optimizing application's code layout based on
   5 execution profile gathered by sampling profiler, such as Linux `perf` tool.
   6 An overview of the ideas implemented in BOLT along with a discussion of its
   7 potential and current results is available in
   8 [CGO'19 paper](https://research.fb.com/publications/bolt-a-practical-binary-optimizer-for-data-centers-and-beyond/).
   9
  10 ## Input Binary Requirements
  11
  12 BOLT operates on X86-64 and AArch64 ELF binaries. At the minimum, the binaries
  13 should have an unstripped symbol table, and, to get maximum performance gains,
  14 they should be linked with relocations (`--emit-relocs` or `-q` linker flag).
  15
  16 BOLT disassembles functions and reconstructs the control flow graph (CFG)
  17 before it runs optimizations. Since this is a nontrivial task,
  18 especially when indirect branches are present, we rely on certain heuristics
  19 to accomplish it. These heuristics have been tested on a code generated with
  20 Clang and GCC compilers. The main requirement for C/C++ code is not to rely
  21 on code layout properties, such as function pointer deltas.
  22 Assembly code can be processed too. Requirements for it include a clear
  23 separation of code and data, with data objects being placed into data
  24 sections/segments. If indirect jumps are used for intra-function control
  25 transfer (e.g., jump tables), the code patterns should be matching those
  26 generated by Clang/GCC.
  27
  28 NOTE: BOLT is currently incompatible with the `-freorder-blocks-and-partition`
  29 compiler option. Since GCC8 enables this option by default, you have to
  30 explicitly disable it by adding `-fno-reorder-blocks-and-partition` flag if
  31 you are compiling with GCC8 or above.
  32
  33 NOTE2: DWARF v5 is the new debugging format generated by the latest LLVM and GCC
  34 compilers. It offers several benefits over the previous DWARF v4. Currently, the
  35 support for v5 is a work in progress for BOLT. While you will be able to
  36 optimize binaries produced by the latest compilers, until the support is
  37 complete, you will not be able to update the debug info with
  38 `-update-debug-sections`. To temporarily work around the issue, we recommend
  39 compiling binaries with `-gdwarf-4` option that forces DWARF v4 output.
  40
  41 PIE and .so support has been added recently. Please report bugs if you
  42 encounter any issues.
  43
  44 ## Installation
  45
  46 ### Docker Image
  47
  48 You can build and use the docker image containing BOLT using our [docker file](utils/docker/Dockerfile).
  49 Alternatively, you can build BOLT manually using the steps below.
  50
  51 ### Manual Build
  52
  53 BOLT heavily uses LLVM libraries, and by design, it is built as one of LLVM
  54 tools. The build process is not much different from a regular LLVM build.
  55 The following instructions are assuming that you are running under Linux.
  56
  57 Start with cloning LLVM repo:
  58
  59 ```
  60 > git clone https://github.com/llvm/llvm-project.git
  61 > mkdir build
  62 > cd build
  63 > cmake -G Ninja ../llvm-project/llvm -DLLVM_TARGETS_TO_BUILD="X86;AArch64" -DCMAKE_BUILD_TYPE=Release -DLLVM_ENABLE_ASSERTIONS=ON -DLLVM_ENABLE_PROJECTS="bolt"
  64 > ninja bolt
  65 ```
  66
  67 `llvm-bolt` will be available under `bin/`. Add this directory to your path to
  68 ensure the rest of the commands in this tutorial work.
  69
  70 ## Optimizing BOLT's Performance
  71
  72 BOLT runs many internal passes in parallel. If you foresee heavy usage of
  73 BOLT, you can improve the processing time by linking against one of memory
  74 allocation libraries with good support for concurrency. E.g. to use jemalloc:
  75
  76 ```
  77 > sudo yum install jemalloc-devel
  78 > LD_PRELOAD=/usr/lib64/libjemalloc.so llvm-bolt ....
  79 ```
  80 Or if you rather use tcmalloc:
  81 ```
  82 > sudo yum install gperftools-devel
  83 > LD_PRELOAD=/usr/lib64/libtcmalloc_minimal.so llvm-bolt ....
  84 ```
  85
  86 ## Usage
  87
  88 For a complete practical guide of using BOLT see [Optimizing Clang with BOLT](docs/OptimizingClang.md).
  89
  90 ### Step 0
  91
  92 In order to allow BOLT to re-arrange functions (in addition to re-arranging
  93 code within functions) in your program, it needs a little help from the linker.
  94 Add `--emit-relocs` to the final link step of your application. You can verify
  95 the presence of relocations by checking for `.rela.text` section in the binary.
  96 BOLT will also report if it detects relocations while processing the binary.
  97
  98 ### Step 1: Collect Profile
  99
 100 This step is different for different kinds of executables. If you can invoke
 101 your program to run on a representative input from a command line, then check
 102 **For Applications** section below. If your program typically runs as a
 103 server/service, then skip to **For Services** section.
 104
 105 The version of `perf` command used for the following steps has to support
 106 `-F brstack` option. We recommend using `perf` version 4.5 or later.
 107
 108 #### For Applications
 109
 110 This assumes you can run your program from a command line with a typical input.
 111 In this case, simply prepend the command line invocation with `perf`:
 112 ```
 113 $ perf record -e cycles:u -j any,u -o perf.data -- <executable> <args> ...
 114 ```
 115
 116 #### For Services
 117
 118 Once you get the service deployed and warmed-up, it is time to collect perf
 119 data with LBR (branch information). The exact perf command to use will depend
 120 on the service. E.g., to collect the data for all processes running on the
 121 server for the next 3 minutes use:
 122 ```
 123 $ perf record -e cycles:u -j any,u -a -o perf.data -- sleep 180
 124 ```
 125
 126 Depending on the application, you may need more samples to be included with
 127 your profile. It's hard to tell upfront what would be a sweet spot for your
 128 application. We recommend the profile to cover 1B instructions as reported
 129 by BOLT `-dyno-stats` option. If you need to increase the number of samples
 130 in the profile, you can either run the `sleep` command for longer and use
 131 `-F<N>` option with `perf` to increase sampling frequency.
 132
 133 Note that for profile collection we recommend using cycle events and not
 134 `BR_INST_RETIRED.*`. Empirically we found it to produce better results.
 135
 136 If the collection of a profile with branches is not available, e.g., when you run on
 137 a VM or on hardware that does not support it, then you can use only sample
 138 events, such as cycles. In this case, the quality of the profile information
 139 would not be as good, and performance gains with BOLT are expected to be lower.
 140
 141 #### With instrumentation
 142
 143 If perf record is not available to you, you may collect profile by first
 144 instrumenting the binary with BOLT and then running it.
 145 ```
 146 llvm-bolt <executable> -instrument -o <instrumented-executable>
 147 ```
 148
 149 After you run instrumented-executable with the desired workload, its BOLT
 150 profile should be ready for you in `/tmp/prof.fdata` and you can skip
 151 **Step 2**.
 152
 153 Run BOLT with the `-help` option and check the category "BOLT instrumentation
 154 options" for a quick reference on instrumentation knobs.
 155
 156 ### Step 2: Convert Profile to BOLT Format
 157
 158 NOTE: you can skip this step and feed `perf.data` directly to BOLT using
 159 experimental `-p perf.data` option.
 160
 161 For this step, you will need `perf.data` file collected from the previous step and
 162 a copy of the binary that was running. The binary has to be either
 163 unstripped, or should have a symbol table intact (i.e., running `strip -g` is
 164 okay).
 165
 166 Make sure `perf` is in your `PATH`, and execute `perf2bolt`:
 167 ```
 168 $ perf2bolt -p perf.data -o perf.fdata <executable>
 169 ```
 170
 171 This command will aggregate branch data from `perf.data` and store it in a
 172 format that is both more compact and more resilient to binary modifications.
 173
 174 If the profile was collected without LBRs, you will need to add `-nl` flag to
 175 the command line above.
 176
 177 ### Step 3: Optimize with BOLT
 178
 179 Once you have `perf.fdata` ready, you can use it for optimizations with
 180 BOLT. Assuming your environment is setup to include the right path, execute
 181 `llvm-bolt`:
 182 ```
 183 $ llvm-bolt <executable> -o <executable>.bolt -data=perf.fdata -reorder-blocks=ext-tsp -reorder-functions=hfsort -split-functions -split-all-cold -split-eh -dyno-stats
 184 ```
 185
 186 If you do need an updated debug info, then add `-update-debug-sections` option
 187 to the command above. The processing time will be slightly longer.
 188
 189 For a full list of options see `-help`/`-help-hidden` output.
 190
 191 The input binary for this step does not have to 100% match the binary used for
 192 profile collection in **Step 1**. This could happen when you are doing active
 193 development, and the source code constantly changes, yet you want to benefit
 194 from profile-guided optimizations. However, since the binary is not precisely the
 195 same, the profile information could become invalid or stale, and BOLT will
 196 report the number of functions with a stale profile. The higher the
 197 number, the less performance improvement should be expected. Thus, it is
 198 crucial to update `.fdata` for release branches.
 199
 200 ## Multiple Profiles
 201
 202 Suppose your application can run in different modes, and you can generate
 203 multiple profiles for each one of them. To generate a single binary that can
 204 benefit all modes (assuming the profiles don't contradict each other) you can
 205 use `merge-fdata` tool:
 206 ```
 207 $ merge-fdata *.fdata > combined.fdata
 208 ```
 209 Use `combined.fdata` for **Step 3** above to generate a universally optimized
 210 binary.
 211
 212 ## License
 213
 214 BOLT is licensed under the [Apache License v2.0 with LLVM Exceptions](./LICENSE.TXT).