bolt/docs/OptimizingClang.md

   1 # Optimizing Clang : A Practical Example of Applying BOLT
   2
   3 ## Preface
   4
   5 *BOLT* (Binary Optimization and Layout Tool) is designed to improve the application
   6 performance by laying out code in a manner that helps CPU better utilize its caching and
   7 branch predicting resources.
   8
   9 The most obvious candidates for BOLT optimizations
  10 are programs that suffer from many instruction cache and iTLB misses, such as
  11 large applications measuring over hundreds of megabytes in size. However, medium-sized
  12 programs can benefit too. Clang, one of the most popular open-source C/C++ compilers,
  13 is a good example of the latter. Its code size could easily be in the order of tens of megabytes.
  14 As we will see, the Clang binary suffers from many instruction cache
  15 misses and can be significantly improved with BOLT, even on top of profile-guided and
  16 link-time optimizations.
  17
  18 In this tutorial we will first build Clang with PGO and LTO, and then will show steps on how to
  19 apply BOLT optimizations to make Clang up to 15% faster. We will also analyze where
  20 the compile-time performance gains are coming from, and verify that the speed-ups are
  21 sustainable while building other applications.
  22
  23 ## Building Clang
  24
  25 The process of getting Clang sources and performing the build is very similar to the
  26 one described at http://clang.llvm.org/get_started.html. For completeness, we provide the detailed steps
  27 on how to obtain and build Clang in [Bootstrapping Clang-7 with PGO and LTO](#bootstrapping-clang-7-with-pgo-and-lto) section.
  28
  29 The only difference from the standard Clang build is that we require the `-Wl,-q` flag to be present during
  30 the final link. This option saves relocation metadata in the executable file, but does not affect
  31 the generated code in any way.
  32
  33 ## Optimizing Clang with BOLT
  34
  35 We will use the setup described in [Bootstrapping Clang-7 with PGO and LTO](#bootstrapping-clang-7-with-pgo-and-lto).
  36 Adjust the steps accordingly if you skipped that section. We will also assume that `llvm-bolt` is present in your `$PATH`.
  37
  38 Before we can run BOLT optimizations, we need to collect the profile for Clang, and we will use
  39 Clang/LLVM sources for that.
  40 Collecting accurate profile requires running `perf` on a hardware that
  41 implements taken branch sampling (`-b/-j` flag). For that reason, it may not be possible to
  42 collect the accurate profile in a virtualized environment, e.g. in the cloud.
  43 We do support regular sampling profiles, but the performance
  44 improvements are expected to be more modest.
  45
  46 ```bash
  47 $ mkdir ${TOPLEV}/stage3
  48 $ cd ${TOPLEV}/stage3
  49 $ CPATH=${TOPLEV}/stage2-prof-use-lto/install/bin/
  50 $ cmake -G Ninja ${TOPLEV}/llvm -DLLVM_TARGETS_TO_BUILD=X86 -DCMAKE_BUILD_TYPE=Release \
  51     -DCMAKE_C_COMPILER=$CPATH/clang -DCMAKE_CXX_COMPILER=$CPATH/clang++ \
  52     -DLLVM_ENABLE_PROJECTS="clang" \
  53     -DLLVM_USE_LINKER=lld -DCMAKE_INSTALL_PREFIX=${TOPLEV}/stage3/install
  54 $ perf record -e cycles:u -j any,u -- ninja clang
  55 ```
  56
  57 Once the last command is finished, it will create a `perf.data` file larger than 10GiB.
  58 We will first convert this profile into a more compact aggregated
  59 form suitable to be consumed by BOLT:
  60 ```bash
  61   $ perf2bolt $CPATH/clang-7 -p perf.data -o clang-7.fdata -w clang-7.yaml
  62 ```
  63 Notice that we are passing `clang-7` to `perf2bolt` which is the real binary that
  64 `clang` and `clang++` are symlinking to. The next step will optimize Clang using
  65 the generated profile:
  66 ```bash
  67 $ llvm-bolt $CPATH/clang-7 -o $CPATH/clang-7.bolt -b clang-7.yaml \
  68     -reorder-blocks=ext-tsp -reorder-functions=hfsort+ -split-functions \
  69     -split-all-cold -dyno-stats -icf=1 -use-gnu-stack
  70 ```
  71 The output will look similar to the one below:
  72 ```t
  73 ...
  74 BOLT-INFO: enabling relocation mode
  75 BOLT-INFO: 11415 functions out of 104526 simple functions (10.9%) have non-empty execution profile.
  76 ...
  77 BOLT-INFO: ICF folded 29144 out of 105177 functions in 8 passes. 82 functions had jump tables.
  78 BOLT-INFO: Removing all identical functions will save 5466.69 KB of code space. Folded functions were called 2131985 times based on profile.
  79 BOLT-INFO: basic block reordering modified layout of 7848 (10.32%) functions
  80 ...
  81            660155947 : executed forward branches (-2.3%)
  82             48252553 : taken forward branches (-57.2%)
  83            129897961 : executed backward branches (+13.8%)
  84             52389551 : taken backward branches (-19.5%)
  85             35650038 : executed unconditional branches (-33.2%)
  86            128338874 : all function calls (=)
  87             19010563 : indirect calls (=)
  88              9918250 : PLT calls (=)
  89           6113398840 : executed instructions (-0.6%)
  90           1519537463 : executed load instructions (=)
  91            943321306 : executed store instructions (=)
  92             20467109 : taken jump table branches (=)
  93            825703946 : total branches (-2.1%)
  94            136292142 : taken branches (-41.1%)
  95            689411804 : non-taken conditional branches (+12.6%)
  96            100642104 : taken conditional branches (-43.4%)
  97            790053908 : all conditional branches (=)
  98 ...
  99 ```
 100 The statistics in the output is based on the LBR profile collected with `perf`, and since we were using
 101 the `cycles` counter, its accuracy is affected. However, the relative improvement in `taken conditional
 102  branches` is a good indication that BOLT was able to straighten out the code even after PGO.
 103
 104 ## Measuring Compile-time Improvement
 105
 106 `clang-7.bolt` can be used as a replacement for *PGO+LTO* Clang:
 107 ```bash
 108 $ mv $CPATH/clang-7 $CPATH/clang-7.org
 109 $ ln -fs $CPATH/clang-7.bolt $CPATH/clang-7
 110 ```
 111 Doing a new build of Clang using the new binary shows a significant overall
 112 build time reduction on a 48-core Haswell system:
 113 ```bash
 114 $ ln -fs $CPATH/clang-7.org $CPATH/clang-7
 115 $ ninja clean && /bin/time -f %e ninja clang -j48
 116 202.72
 117 $ ln -fs $CPATH/clang-7.bolt $CPATH/clang-7
 118 $ ninja clean && /bin/time -f %e ninja clang -j48
 119 180.11
 120 ```
 121 That's 22.61 seconds (or 12%) faster compared to the *PGO+LTO* build.
 122 Notice that we are measuring an improvement of the total build time, which includes the time spent in the linker.
 123 Compilation time improvements for individual files differ, and speedups over 15% are not uncommon.
 124 If we run BOLT on a Clang binary compiled without *PGO+LTO* (in which case the build is finished in 253.32 seconds),
 125 the gains we see are over 50 seconds (25%),
 126 but, as expected, the result is still slower than *PGO+LTO+BOLT* build.
 127
 128 ## Source of the Wins
 129
 130 We mentioned that Clang suffers from considerable instruction cache misses. This can be measured with `perf`:
 131 ```bash
 132 $ ln -fs $CPATH/clang-7.org $CPATH/clang-7
 133 $ ninja clean && perf stat -e instructions,L1-icache-misses -- ninja clang -j48
 134   ...
 135    16,366,101,626,647      instructions
 136       359,996,216,537      L1-icache-misses
 137 ```
 138 That's about 22 instruction cache misses per thousand instructions. As a rule of thumb, if the application
 139 has over 10 misses per thousand instructions, it is a good indication that it will be improved by BOLT.
 140 Now let's see how many misses are in the BOLTed binary:
 141 ```bash
 142 $ ln -fs $CPATH/clang-7.bolt $CPATH/clang-7
 143 $ ninja clean && perf stat -e instructions,L1-icache-misses -- ninja clang -j48
 144   ...
 145   16,319,818,488,769      instructions
 146      244,888,677,972      L1-icache-misses
 147 ```
 148 The number of misses per thousand instructions went down from 22 to 15, significantly reducing
 149 the number of stalls in the CPU front-end.
 150 Notice how the number of executed instructions stayed roughly the same. That's because we didn't
 151 run any optimizations beyond the ones affecting the code layout. Other than instruction cache misses,
 152 BOLT also improves branch mispredictions, iTLB misses, and misses in L2 and L3.
 153
 154 ## Using Clang for Other Applications
 155
 156 We have collected profile for Clang using its own source code. Would it be enough to speed up
 157 the compilation of other projects? We picked `mysqld`, an open-source database, to do the test.
 158
 159 On our 48-core Haswell system using the *PGO+LTO* Clang, the build finished in 136.06 seconds, while using the *PGO+LTO+BOLT* Clang, 126.10 seconds.
 160 That's a noticeable improvement, but not as significant as the one we saw on Clang itself.
 161 This is partially because the number of instruction cache misses is slightly lower on this scenario : 19 vs 22.
 162 Another reason is that Clang is run with a different set of options while building `mysqld` compared
 163 to the training run.
 164
 165 Different options exercise different code paths, and
 166 if we trained without a specific option, we may have misplaced parts of the code responsible for handling it.
 167 To test this theory, we have collected another `perf` profile while building `mysqld`, and merged it with an existing profile
 168 using the `merge-fdata` utility that comes with BOLT. Optimized with that profile, the *PGO+LTO+BOLT* Clang was able
 169 to perform the `mysqld` build in 124.74 seconds, i.e. 11 seconds or 9% faster compared to *PGO+LGO* Clang.
 170 The merged profile didn't make the original Clang compilation slower either, while the number of profiled functions in Clang increased from 11,415 to 14,025.
 171
 172 Ideally, the profile run has to be done with a superset of all commonly used options. However, the main improvement is expected with just the basic set.
 173
 174 ## Summary
 175
 176 In this tutorial we demonstrated how to use BOLT to improve the
 177 performance of the Clang compiler. Similarly, BOLT could be used to improve the performance
 178 of GCC, or any other application suffering from a high number of instruction
 179 cache misses.
 180
 181 ----
 182 # Appendix
 183
 184 ## Bootstrapping Clang-7 with PGO and LTO
 185
 186 Below we describe detailed steps to build Clang, and make it ready for BOLT
 187 optimizations. If you already have the build setup, you can skip this section,
 188 except for the last step that adds `-Wl,-q` linker flag to the final build.
 189
 190 ### Getting Clang-7 Sources
 191
 192 Set `$TOPLEV` to the directory of your preference where you would like to do
 193 builds. E.g. `TOPLEV=~/clang-7/`. Follow with commands to clone the `release_70`
 194 branch of LLVM monorepo:
 195 ```bash
 196 $ mkdir ${TOPLEV}
 197 $ cd ${TOPLEV}
 198 $ git clone --branch=release/7.x https://github.com/llvm/llvm-project.git
 199 ```
 200
 201 ### Building Stage 1 Compiler
 202
 203 Stage 1 will be the first build we are going to do, and we will be using the
 204 default system compiler to build Clang. If your system lacks a compiler, use
 205 your distribution package manager to install one that supports C++11. In this
 206 example we are going to use GCC. In addition to the compiler, you will need the
 207 `cmake` and `ninja` packages. Note that we disable the build of certain
 208 compiler-rt components that are known to cause build issues at release/7.x.
 209 ```bash
 210 $ mkdir ${TOPLEV}/stage1
 211 $ cd ${TOPLEV}/stage1
 212 $ cmake -G Ninja ${TOPLEV}/llvm-project/llvm -DLLVM_TARGETS_TO_BUILD=X86 \
 213       -DCMAKE_BUILD_TYPE=Release \
 214       -DCMAKE_C_COMPILER=gcc -DCMAKE_CXX_COMPILER=g++ -DCMAKE_ASM_COMPILER=gcc \
 215       -DLLVM_ENABLE_PROJECTS="clang;lld" \
 216       -DLLVM_ENABLE_RUNTIMES="compiler-rt" \
 217       -DCOMPILER_RT_BUILD_SANITIZERS=OFF -DCOMPILER_RT_BUILD_XRAY=OFF \
 218       -DCOMPILER_RT_BUILD_LIBFUZZER=OFF \
 219       -DCMAKE_INSTALL_PREFIX=${TOPLEV}/stage1/install
 220 $ ninja install
 221 ```
 222
 223 ### Building Stage 2 Compiler With Instrumentation
 224
 225 Using the freshly-baked stage 1 Clang compiler, we are going to build Clang with
 226 profile generation capabilities:
 227 ```bash
 228 $ mkdir ${TOPLEV}/stage2-prof-gen
 229 $ cd ${TOPLEV}/stage2-prof-gen
 230 $ CPATH=${TOPLEV}/stage1/install/bin/
 231 $ cmake -G Ninja ${TOPLEV}/llvm-project/llvm -DLLVM_TARGETS_TO_BUILD=X86 \
 232     -DCMAKE_BUILD_TYPE=Release \
 233     -DCMAKE_C_COMPILER=$CPATH/clang -DCMAKE_CXX_COMPILER=$CPATH/clang++ \
 234     -DLLVM_ENABLE_PROJECTS="clang;lld" \
 235     -DLLVM_USE_LINKER=lld -DLLVM_BUILD_INSTRUMENTED=ON \
 236     -DCMAKE_INSTALL_PREFIX=${TOPLEV}/stage2-prof-gen/install
 237 $ ninja install
 238 ```
 239
 240 ### Generating Profile for PGO
 241
 242 While there are many ways to obtain the profile data, we are going to use the
 243 source code already at our disposal, i.e. we are going to collect the profile
 244 while building Clang itself:
 245 ```bash
 246 $ mkdir ${TOPLEV}/stage3-train
 247 $ cd ${TOPLEV}/stage3-train
 248 $ CPATH=${TOPLEV}/stage2-prof-gen/install/bin
 249 $ cmake -G Ninja ${TOPLEV}/llvm-project/llvm -DLLVM_TARGETS_TO_BUILD=X86 \
 250     -DCMAKE_BUILD_TYPE=Release \
 251     -DCMAKE_C_COMPILER=$CPATH/clang -DCMAKE_CXX_COMPILER=$CPATH/clang++ \
 252     -DLLVM_ENABLE_PROJECTS="clang" \
 253     -DLLVM_USE_LINKER=lld -DCMAKE_INSTALL_PREFIX=${TOPLEV}/stage3-train/install
 254 $ ninja clang
 255 ```
 256 Once the build is completed, the profile files will be saved under
 257 `${TOPLEV}/stage2-prof-gen/profiles`. We will merge them before they can be
 258 passed back into Clang:
 259 ```bash
 260 $ cd ${TOPLEV}/stage2-prof-gen/profiles
 261 $ ${TOPLEV}/stage1/install/bin/llvm-profdata merge -output=clang.profdata *
 262 ```
 263
 264 ### Building Clang with PGO and LTO
 265
 266 Now the profile can be used to guide optimizations to produce better code for
 267 our scenario, i.e. building Clang. We will also enable link-time optimizations
 268 to allow cross-module inlining and other optimizations. Finally, we are going to
 269 add one extra step that is useful for BOLT: a linker flag instructing it to
 270 preserve relocations in the output binary. Note that this flag does not affect
 271 the generated code or data used at runtime, it only writes metadata to the file
 272 on disk:
 273 ```bash
 274 $ mkdir ${TOPLEV}/stage2-prof-use-lto
 275 $ cd ${TOPLEV}/stage2-prof-use-lto
 276 $ CPATH=${TOPLEV}/stage1/install/bin/
 277 $ export LDFLAGS="-Wl,-q"
 278 $ cmake -G Ninja ${TOPLEV}/llvm-project/llvm -DLLVM_TARGETS_TO_BUILD=X86 \
 279     -DCMAKE_BUILD_TYPE=Release \
 280     -DCMAKE_C_COMPILER=$CPATH/clang -DCMAKE_CXX_COMPILER=$CPATH/clang++ \
 281     -DLLVM_ENABLE_PROJECTS="clang;lld" \
 282     -DLLVM_ENABLE_LTO=Full \
 283     -DLLVM_PROFDATA_FILE=${TOPLEV}/stage2-prof-gen/profiles/clang.profdata \
 284     -DLLVM_USE_LINKER=lld \
 285     -DCMAKE_INSTALL_PREFIX=${TOPLEV}/stage2-prof-use-lto/install
 286 $ ninja install
 287 ```
 288 Now we have a Clang compiler that can build itself much faster. As we will see,
 289 it builds other applications faster as well, and, with BOLT, the compile time
 290 can be improved even further.