bolt/docs/index.rst

   1 BOLT
   2 ====
   3
   4 BOLT is a post-link optimizer developed to speed up large applications.
   5 It achieves the improvements by optimizing application’s code layout
   6 based on execution profile gathered by sampling profiler, such as Linux
   7 ``perf`` tool. An overview of the ideas implemented in BOLT along with a
   8 discussion of its potential and current results is available in `CGO’19
   9 paper <https://research.fb.com/publications/bolt-a-practical-binary-optimizer-for-data-centers-and-beyond/>`__.
  10
  11 Input Binary Requirements
  12 -------------------------
  13
  14 BOLT operates on X86-64 and AArch64 ELF binaries. At the minimum, the
  15 binaries should have an unstripped symbol table, and, to get maximum
  16 performance gains, they should be linked with relocations
  17 (``--emit-relocs`` or ``-q`` linker flag).
  18
  19 BOLT disassembles functions and reconstructs the control flow graph
  20 (CFG) before it runs optimizations. Since this is a nontrivial task,
  21 especially when indirect branches are present, we rely on certain
  22 heuristics to accomplish it. These heuristics have been tested on a code
  23 generated with Clang and GCC compilers. The main requirement for C/C++
  24 code is not to rely on code layout properties, such as function pointer
  25 deltas. Assembly code can be processed too. Requirements for it include
  26 a clear separation of code and data, with data objects being placed into
  27 data sections/segments. If indirect jumps are used for intra-function
  28 control transfer (e.g., jump tables), the code patterns should be
  29 matching those generated by Clang/GCC.
  30
  31 NOTE: BOLT is currently incompatible with the
  32 ``-freorder-blocks-and-partition`` compiler option. Since GCC8 enables
  33 this option by default, you have to explicitly disable it by adding
  34 ``-fno-reorder-blocks-and-partition`` flag if you are compiling with
  35 GCC8 or above.
  36
  37 NOTE2: DWARF v5 is the new debugging format generated by the latest LLVM
  38 and GCC compilers. It offers several benefits over the previous DWARF
  39 v4. Currently, the support for v5 is a work in progress for BOLT. While
  40 you will be able to optimize binaries produced by the latest compilers,
  41 until the support is complete, you will not be able to update the debug
  42 info with ``-update-debug-sections``. To temporarily work around the
  43 issue, we recommend compiling binaries with ``-gdwarf-4`` option that
  44 forces DWARF v4 output.
  45
  46 PIE and .so support has been added recently. Please report bugs if you
  47 encounter any issues.
  48
  49 Installation
  50 ------------
  51
  52 Docker Image
  53 ~~~~~~~~~~~~
  54
  55 You can build and use the docker image containing BOLT using our `docker
  56 file <utils/docker/Dockerfile>`__. Alternatively, you can build BOLT
  57 manually using the steps below.
  58
  59 Manual Build
  60 ~~~~~~~~~~~~
  61
  62 BOLT heavily uses LLVM libraries, and by design, it is built as one of
  63 LLVM tools. The build process is not much different from a regular LLVM
  64 build. The following instructions are assuming that you are running
  65 under Linux.
  66
  67 Start with cloning LLVM repo:
  68
  69 ::
  70
  71     > git clone https://github.com/llvm/llvm-project.git
  72     > mkdir build
  73     > cd build
  74     > cmake -G Ninja ../llvm-project/llvm -DLLVM_TARGETS_TO_BUILD="X86;AArch64" -DCMAKE_BUILD_TYPE=Release -DLLVM_ENABLE_ASSERTIONS=ON -DLLVM_ENABLE_PROJECTS="bolt"
  75     > ninja bolt
  76
  77 ``llvm-bolt`` will be available under ``bin/``. Add this directory to
  78 your path to ensure the rest of the commands in this tutorial work.
  79
  80 Optimizing BOLT’s Performance
  81 -----------------------------
  82
  83 BOLT runs many internal passes in parallel. If you foresee heavy usage
  84 of BOLT, you can improve the processing time by linking against one of
  85 memory allocation libraries with good support for concurrency. E.g. to
  86 use jemalloc:
  87
  88 ::
  89
  90     > sudo yum install jemalloc-devel
  91     > LD_PRELOAD=/usr/lib64/libjemalloc.so llvm-bolt ....
  92
  93 Or if you rather use tcmalloc:
  94
  95 ::
  96
  97     > sudo yum install gperftools-devel
  98     > LD_PRELOAD=/usr/lib64/libtcmalloc_minimal.so llvm-bolt ....
  99
 100 Usage
 101 -----
 102
 103 For a complete practical guide of using BOLT see `Optimizing Clang with
 104 BOLT <docs/OptimizingClang.md>`__.
 105
 106 Step 0
 107 ~~~~~~
 108
 109 In order to allow BOLT to re-arrange functions (in addition to
 110 re-arranging code within functions) in your program, it needs a little
 111 help from the linker. Add ``--emit-relocs`` to the final link step of
 112 your application. You can verify the presence of relocations by checking
 113 for ``.rela.text`` section in the binary. BOLT will also report if it
 114 detects relocations while processing the binary.
 115
 116 Step 1: Collect Profile
 117 ~~~~~~~~~~~~~~~~~~~~~~~
 118
 119 This step is different for different kinds of executables. If you can
 120 invoke your program to run on a representative input from a command
 121 line, then check **For Applications** section below. If your program
 122 typically runs as a server/service, then skip to **For Services**
 123 section.
 124
 125 The version of ``perf`` command used for the following steps has to
 126 support ``-F brstack`` option. We recommend using ``perf`` version 4.5
 127 or later.
 128
 129 For Applications
 130 ^^^^^^^^^^^^^^^^
 131
 132 This assumes you can run your program from a command line with a typical
 133 input. In this case, simply prepend the command line invocation with
 134 ``perf``:
 135
 136 ::
 137
 138     $ perf record -e cycles:u -j any,u -o perf.data -- <executable> <args> ...
 139
 140 For Services
 141 ^^^^^^^^^^^^
 142
 143 Once you get the service deployed and warmed-up, it is time to collect
 144 perf data with LBR (branch information). The exact perf command to use
 145 will depend on the service. E.g., to collect the data for all processes
 146 running on the server for the next 3 minutes use:
 147
 148 ::
 149
 150     $ perf record -e cycles:u -j any,u -a -o perf.data -- sleep 180
 151
 152 Depending on the application, you may need more samples to be included
 153 with your profile. It’s hard to tell upfront what would be a sweet spot
 154 for your application. We recommend the profile to cover 1B instructions
 155 as reported by BOLT ``-dyno-stats`` option. If you need to increase the
 156 number of samples in the profile, you can either run the ``sleep``
 157 command for longer and use ``-F<N>`` option with ``perf`` to increase
 158 sampling frequency.
 159
 160 Note that for profile collection we recommend using cycle events and not
 161 ``BR_INST_RETIRED.*``. Empirically we found it to produce better
 162 results.
 163
 164 If the collection of a profile with branches is not available, e.g.,
 165 when you run on a VM or on hardware that does not support it, then you
 166 can use only sample events, such as cycles. In this case, the quality of
 167 the profile information would not be as good, and performance gains with
 168 BOLT are expected to be lower.
 169
 170 With instrumentation
 171 ^^^^^^^^^^^^^^^^^^^^
 172
 173 If perf record is not available to you, you may collect profile by first
 174 instrumenting the binary with BOLT and then running it.
 175
 176 ::
 177
 178     llvm-bolt <executable> -instrument -o <instrumented-executable>
 179
 180 After you run instrumented-executable with the desired workload, its
 181 BOLT profile should be ready for you in ``/tmp/prof.fdata`` and you can
 182 skip **Step 2**.
 183
 184 Run BOLT with the ``-help`` option and check the category “BOLT
 185 instrumentation options” for a quick reference on instrumentation knobs.
 186
 187 Step 2: Convert Profile to BOLT Format
 188 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 189
 190 NOTE: you can skip this step and feed ``perf.data`` directly to BOLT
 191 using experimental ``-p perf.data`` option.
 192
 193 For this step, you will need ``perf.data`` file collected from the
 194 previous step and a copy of the binary that was running. The binary has
 195 to be either unstripped, or should have a symbol table intact (i.e.,
 196 running ``strip -g`` is okay).
 197
 198 Make sure ``perf`` is in your ``PATH``, and execute ``perf2bolt``:
 199
 200 ::
 201
 202     $ perf2bolt -p perf.data -o perf.fdata <executable>
 203
 204 This command will aggregate branch data from ``perf.data`` and store it
 205 in a format that is both more compact and more resilient to binary
 206 modifications.
 207
 208 If the profile was collected without LBRs, you will need to add ``-nl``
 209 flag to the command line above.
 210
 211 Step 3: Optimize with BOLT
 212 ~~~~~~~~~~~~~~~~~~~~~~~~~~
 213
 214 Once you have ``perf.fdata`` ready, you can use it for optimizations
 215 with BOLT. Assuming your environment is setup to include the right path,
 216 execute ``llvm-bolt``:
 217
 218 ::
 219
 220     $ llvm-bolt <executable> -o <executable>.bolt -data=perf.fdata -reorder-blocks=ext-tsp -reorder-functions=hfsort -split-functions -split-all-cold -split-eh -dyno-stats
 221
 222 If you do need an updated debug info, then add
 223 ``-update-debug-sections`` option to the command above. The processing
 224 time will be slightly longer.
 225
 226 For a full list of options see ``-help``/``-help-hidden`` output.
 227
 228 The input binary for this step does not have to 100% match the binary
 229 used for profile collection in **Step 1**. This could happen when you
 230 are doing active development, and the source code constantly changes,
 231 yet you want to benefit from profile-guided optimizations. However,
 232 since the binary is not precisely the same, the profile information
 233 could become invalid or stale, and BOLT will report the number of
 234 functions with a stale profile. The higher the number, the less
 235 performance improvement should be expected. Thus, it is crucial to
 236 update ``.fdata`` for release branches.
 237
 238 Multiple Profiles
 239 -----------------
 240
 241 Suppose your application can run in different modes, and you can
 242 generate multiple profiles for each one of them. To generate a single
 243 binary that can benefit all modes (assuming the profiles don’t
 244 contradict each other) you can use ``merge-fdata`` tool:
 245
 246 ::
 247
 248     $ merge-fdata *.fdata > combined.fdata
 249
 250 Use ``combined.fdata`` for **Step 3** above to generate a universally
 251 optimized binary.
 252
 253 License
 254 -------
 255
 256 BOLT is licensed under the `Apache License v2.0 with LLVM
 257 Exceptions <./LICENSE.TXT>`__.